This project investigates the relationship between county-level broadband connectivity and social vulnerability across the United States.
Data Year Configuration: This analysis can use
either December 2020 or May 2025 FCC broadband data. Set
USE_FCC_2025 <- TRUE (default) or FALSE in
the setup chunk above.
We use:
FCC Broadband Deployment Data (BDC) County-level broadband coverage data. Configurable to use either December 2020 or May 2025 data. The 2025 data includes continuous coverage fractions at multiple speed tiers.
CDC/ATSDR Social Vulnerability Index (SVI) 2020 – County Level
Provides composite social vulnerability percentiles and four theme-level
indices:
socioeconomic status, household composition, minority status, and
housing/transportation.
U.S. Census / ACS Housing Units (embedded in FCC data) Used as denominators for broadband connection rates.
Geographic FIPS Codes Standard 5-digit county FIPS codes support merging across all datasets.
This project follows a fully reproducible workflow using R Markdown, where all data cleaning, merging, analysis, and visualizations are generated inside this document. # 1.1 Data Inventory Table
Dataset Name : CDC/ATSDR Social Vulnerability Index (SVI) Source / URL : https://www.atsdr.cdc.gov/placeandhealth/svi/data_documentation_download.html Year Used : 2020 Geographic Level: County (FIPS) Format: CSV Files Used: SVI2020_US_COUNTY.csv Size: ~5–6 MB Key Variables: County FIPS, overall SVI percentile, SVI Theme scores (Socioeconomic, Household Composition, Minority Status, Housing/Transportation) Data Quality Notes: Percentile values range 0–1; no SVI data for Puerto Rico; a few suppressed values for remote Alaska counties.
FCC Broadband Deployment Data (December 2020) Dataset Name : FCC Form 477 Fixed Broadband Deployment Summary Source / URL : FCC Broadband Deployment Data Year Used: Dec 2020 Geographic Level: County Format: CSV Files Used: fcc_dec_2020_county.csv (cleaned version) Size: ~20–30 MB Key Variables: FIPS, availability of ≥25/3 Mbps broadband, provider counts, technology codes Data Quality Notes: Known issues include over-reporting availability; no coverage for Puerto Rico or territories.
Microsoft Airband Broadband Usage
Dataset Name: Microsoft Airband “US Broadband Usage” Source / URL: https://github.com/microsoft/USBroadbandUsage Year Used: 2020 Geographic Level : County Format : CSV Files Used : airband_2020_county.csv Size: ~2 MB Key Variables : fips, measured broadband usage, observed availability Data Quality Notes : Usage values are decimals (0–1). No PR or territories.
Dataset Name : ACS 5-Year Estimates — B28002 Source / URL : data.census.gov Year Used : 2020 Geographic Level : County Format : CSV Files Used : ACSDT5Y2020.B28002-Data.csv Size : ~4 MB Key Variables : Total households, households with broadband, households with no internet access Data Quality Notes : Minor MOE columns removed; numeric conversion required.
Dataset Name : ACS 5-Year Estimates — B19013 Source / URL : data.census.gov Year Used : 2020 Geographic Level : County Format : CSV Files Used : ACSDT5Y2020.B19013-Data.csv Size : ~3 MB Key Variables : Median household income Data Quality Notes : No PR values; income suppressed for some remote Alaska regions.
Dataset Name : ACS 5-Year Estimates — B15003 Source / URL : data.census.gov Year Used : 2020 Geographic Level : County Format : CSV Files Used : ACSDT5Y2020.B15003-Data.csv Size : ~4 MB Key Variables : Education levels (HS, bachelor’s, master’s, doctorate) Data Quality Notes :Consistent coverage except PR and some small Alaska areas.
Dataset Name : TIGER/Line 2020 County Shapefile Source / URL : https://www.census.gov/geographies/mapping-files/time-series/geo/tiger-line-file.html Year Used : 2020 Geographic Level : County (MULTIPOLYGON) Format : ESRI Shapefile (.shp, .dbf, .shx, .prj) Files Used : tl_2020_us_county.shp + associated files Size : ~50–100 MB Key Variables : GEOID, county name, state FIPS, geometry Data Quality Notes : Includes 3,234 county-equivalents → includes Puerto Rico, territories, and Alaska census areas. Must be filtered before analysis.
In this section, we work with the FCC Form 477 county-level broadband
tier dataset.
The raw file contains semiannual observations (June and December) for
each county from 2016–2024.
FCC Form 477 includes both deployment (availability) and subscription (adoption) datasets. The deployment dataset — used in this project — measures broadband availability across speed tiers (Tier 1–4) and provides a county-level snapshot of the infrastructure that is available to residents. This is distinct from the subscription dataset, which reflects how many households actually subscribe to broadband. Because our project examines disparities in access, we chosse the deployment dataset.
# Load the raw FCC county tier data file
fcc_raw <- read_csv(
path_raw("fcc", "form477_fcc_data.csv")
)
# View column names and structure
names(fcc_raw)
## [1] "Year" "Month" "FIPS" "State"
## [5] "County" "State_Name" "County_Name" "Housing_Units"
## [9] "Tier_1" "Tier_2" "Tier_3" "Tier_4"
glimpse(fcc_raw)
## Rows: 67,926
## Columns: 12
## $ Year <dbl> 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024, 20…
## $ Month <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,…
## $ FIPS <dbl> 1001, 1003, 1005, 1007, 1009, 1011, 1013, 1015, 1017, 10…
## $ State <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ County <dbl> 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 3…
## $ State_Name <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "…
## $ County_Name <chr> "Autauga County", "Baldwin County", "Barbour County", "B…
## $ Housing_Units <dbl> 25318, 135669, 11816, 9208, 25065, 4616, 9911, 53540, 16…
## $ Tier_1 <dbl> 5, 4, 3, 3, 4, 3, 4, 4, 4, 3, 4, 3, 3, 3, 3, 4, 4, 3, 4,…
## $ Tier_2 <dbl> 5, 4, 3, 3, 3, 3, 4, 4, 4, 3, 4, 2, 3, 3, 3, 4, 4, 2, 4,…
## $ Tier_3 <dbl> 5, 4, 3, 2, 3, 3, 3, 4, 4, 2, 3, 2, 2, 2, 2, 4, 4, 2, 4,…
## $ Tier_4 <dbl> 4, 3, 3, 2, 2, 3, 2, 3, 4, 1, 2, 1, 1, 2, 2, 4, 3, 1, 3,…
fcc_2020_dec <- fcc_raw %>%
rename_with(tolower) %>%
filter(year == 2020, month == 12) %>%
mutate(
county_fips = str_pad(as.character(fips), 5, side = "left", pad = "0")
) %>%
select(
county_fips,
state_name,
county_name,
housing_units,
tier1 = tier_1,
tier2 = tier_2,
tier3 = tier_3
)
head(fcc_2020_dec)
## # A tibble: 6 × 7
## county_fips state_name county_name housing_units tier1 tier2 tier3
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 01001 Alabama Autauga County 24165 4 4 4
## 2 01003 Alabama Baldwin County 122518 4 3 3
## 3 01005 Alabama Barbour County 12120 3 3 2
## 4 01007 Alabama Bibb County 9340 3 2 1
## 5 01009 Alabama Blount County 24625 3 3 2
## 6 01011 Alabama Bullock County 4625 3 3 1
summary(fcc_2020_dec)
## county_fips state_name county_name housing_units
## Length:3234 Length:3234 Length:3234 Min. : 0
## Class :character Class :character Class :character 1st Qu.: 5620
## Mode :character Mode :character Mode :character Median : 12697
## Mean : 44079
## 3rd Qu.: 31556
## Max. :3599420
## tier1 tier2 tier3
## Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:2.000
## Median :4.000 Median :3.000 Median :3.000
## Mean :3.681 Mean :3.272 Mean :2.717
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :5.000 Max. :5.000 Max. :5.000
# If not already in memory, load the non-spatial master dataset
master_2020 <- readRDS("processed_data/master_2020_county.rds")
broadband_check <- master_2020 %>%
select(
FIPS,
housing_units,
tier1, tier2, tier3,
airband_fcc_availability,
airband_usage
)
# 1) Range checks for FCC/Airband measures
summary(broadband_check)
## FIPS housing_units tier1 tier2
## Length:3141 Min. : 221 Min. :4 Min. :1.000
## Class :character 1st Qu.: 6856 1st Qu.:4 1st Qu.:2.000
## Mode :character Median : 15173 Median :4 Median :4.000
## Mean : 50430 Mean :4 Mean :3.104
## 3rd Qu.: 36778 3rd Qu.:4 3rd Qu.:4.000
## Max. :3868643 Max. :4 Max. :4.000
##
## tier3 airband_fcc_availability airband_usage
## Min. :1.000 Min. :0.0002 Min. :0.0010
## 1st Qu.:2.000 1st Qu.:0.7698 1st Qu.:0.1960
## Median :2.000 Median :0.9244 Median :0.3680
## Mean :2.499 Mean :0.8404 Mean :0.3896
## 3rd Qu.:3.000 3rd Qu.:0.9839 3rd Qu.:0.5660
## Max. :4.000 Max. :1.0000 Max. :1.0000
## NA's :9 NA's :11
# Check that availability and usage are between 0 and 1
invalid_range <- broadband_check %>%
filter(
airband_fcc_availability < 0 | airband_fcc_availability > 1 |
airband_usage < 0 | airband_usage > 1
)
invalid_range # ideally 0 rows
## # A tibble: 0 × 7
## # ℹ 7 variables: FIPS <chr>, housing_units <dbl>, tier1 <int>, tier2 <int>,
## # tier3 <int>, airband_fcc_availability <dbl>, airband_usage <dbl>
# Check that tiers (if present) are within 1–4
invalid_tiers <- broadband_check %>%
filter(
(!is.na(tier1) & (tier1 < 1 | tier1 > 4)) |
(!is.na(tier2) & (tier2 < 1 | tier2 > 4)) |
(!is.na(tier3) & (tier3 < 1 | tier3 > 4))
)
invalid_tiers # ideally 0 rows
## # A tibble: 0 × 7
## # ℹ 7 variables: FIPS <chr>, housing_units <dbl>, tier1 <int>, tier2 <int>,
## # tier3 <int>, airband_fcc_availability <dbl>, airband_usage <dbl>
# 2) Missingness in key broadband fields
colSums(is.na(broadband_check))
## FIPS housing_units tier1
## 0 0 0
## tier2 tier3 airband_fcc_availability
## 0 0 9
## airband_usage
## 11
# 3) Simple consistency check: usage should not greatly exceed availability
usage_gt_avail <- broadband_check %>%
filter(
!is.na(airband_fcc_availability),
!is.na(airband_usage),
airband_usage > airband_fcc_availability
)
usage_gt_avail %>% head()
## # A tibble: 6 × 7
## FIPS housing_units tier1 tier2 tier3 airband_fcc_availability airband_usage
## <chr> <dbl> <int> <int> <int> <dbl> <dbl>
## 1 01063 5979 4 1 1 0.0085 0.013
## 2 01105 5751 4 1 1 0.0012 0.032
## 3 02185 3534 4 4 1 0.0061 0.051
## 4 02188 3329 4 4 1 0.0081 0.039
## 5 02198 4267 4 4 1 0.0282 0.04
## 6 02290 5467 4 4 1 0.013 0.139
nrow(usage_gt_avail)
## [1] 50
FCC Data Validation Summary
After cleaning the FCC December 2020 county-level dataset, we verified the quality of all key variables before merging them into the master file. All FCC tiers (tier1, tier2, tier3) fall within the correct range of 1–4 with no invalid values detected. Housing-unit counts are strictly positive for all counties, and no negative, zero, or corrupted values were found. The FCC dataset therefore passes internal consistency checks and is suitable for downstream analysis.
write_csv(fcc_2020_dec, path_processed("fcc_2020_dec_clean.csv"))
saveRDS(fcc_2020_dec, path_processed("fcc_2020_dec_clean.rds"))
In this section, we combine the cleaned FCC broadband dataset (December 2020) with the cleaned SVI 2020 county dataset using 5-digit county FIPS codes.
fcc_2020 <- readRDS(path_processed(fcc_file)) # Uses configured FCC year
svi_2020 <- readRDS(path_processed("svi_2020_county_clean.rds"))
cat("Loaded FCC data:", fcc_year_label, "-", nrow(fcc_2020), "counties\n")
## Loaded FCC data: May 2025 - 3232 counties
head(fcc_2020)
## # A tibble: 6 × 11
## county_fips state_name county_name housing_units tier1 tier2 tier3
## <chr> <chr> <chr> <dbl> <int> <int> <int>
## 1 01001 AL Autauga County 27544 4 4 4
## 2 01003 AL Baldwin County 137072 4 2 2
## 3 01005 AL Barbour County 15445 4 2 2
## 4 01007 AL Bibb County 11168 4 1 1
## 5 01009 AL Blount County 31265 4 1 1
## 6 01011 AL Bullock County 6136 4 3 1
## # ℹ 4 more variables: coverage_25_3 <dbl>, coverage_100_20 <dbl>,
## # coverage_250_25 <dbl>, coverage_1000_100 <dbl>
head(svi_2020)
## # A tibble: 6 × 8
## county_fips state county svi_overall svi_soc svi_hh svi_min svi_hous
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 01001 AL Autauga 0.513 0.384 0.736 0.634 0.431
## 2 01003 AL Baldwin 0.310 0.325 0.272 0.502 0.361
## 3 01005 AL Barbour 0.993 0.957 0.945 0.896 0.995
## 4 01007 AL Bibb 0.808 0.801 0.551 0.629 0.862
## 5 01009 AL Blount 0.514 0.576 0.722 0.415 0.274
## 6 01011 AL Bullock 0.831 0.718 0.652 0.981 0.850
missing_svi <- fcc_2020 %>%
anti_join(svi_2020, by = "county_fips")
head(missing_svi)
## # A tibble: 6 × 11
## county_fips state_name county_name housing_units tier1 tier2 tier3
## <chr> <chr> <chr> <dbl> <int> <int> <int>
## 1 60010 AS Eastern District 3246 4 1 1
## 2 60020 AS Manu'a District 419 4 1 1
## 3 60040 AS Swains Island 1 1 1 1
## 4 60050 AS Western District 5673 4 1 1
## 5 66010 GU Guam 37572 4 3 1
## 6 69100 MP Rota Municipality 868 4 4 1
## # ℹ 4 more variables: coverage_25_3 <dbl>, coverage_100_20 <dbl>,
## # coverage_250_25 <dbl>, coverage_1000_100 <dbl>
nrow(missing_svi)
## [1] 89
The missing_svi table shows FCC counties and territories that appear in the broadband data but do not have corresponding SVI 2020 county records. These are primarily U.S. territories such as Puerto Rico, Guam, American Samoa, and the U.S. Virgin Islands. Because SVI is not available for these areas, they will be excluded from our final merged dataset. ## 4.3 Create Final Merged Dataset (FCC + SVI)
merged_data <- fcc_2020 %>%
inner_join(svi_2020, by = "county_fips")
# Inspect merged dataset
nrow(merged_data)
## [1] 3143
head(merged_data)
## # A tibble: 6 × 18
## county_fips state_name county_name housing_units tier1 tier2 tier3
## <chr> <chr> <chr> <dbl> <int> <int> <int>
## 1 01001 AL Autauga County 27544 4 4 4
## 2 01003 AL Baldwin County 137072 4 2 2
## 3 01005 AL Barbour County 15445 4 2 2
## 4 01007 AL Bibb County 11168 4 1 1
## 5 01009 AL Blount County 31265 4 1 1
## 6 01011 AL Bullock County 6136 4 3 1
## # ℹ 11 more variables: coverage_25_3 <dbl>, coverage_100_20 <dbl>,
## # coverage_250_25 <dbl>, coverage_1000_100 <dbl>, state <chr>, county <chr>,
## # svi_overall <dbl>, svi_soc <dbl>, svi_hh <dbl>, svi_min <dbl>,
## # svi_hous <dbl>
summary(merged_data)
## county_fips state_name county_name housing_units
## Length:3143 Length:3143 Length:3143 Min. : 221
## Class :character Class :character Class :character 1st Qu.: 6847
## Mode :character Mode :character Mode :character Median : 15075
## Mean : 50401
## 3rd Qu.: 36756
## Max. :3868643
## tier1 tier2 tier3 coverage_25_3 coverage_100_20
## Min. :4 Min. :1.000 Min. :1.000 Min. :0.9996 Min. :0.0000
## 1st Qu.:4 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:1.0000 1st Qu.:0.7074
## Median :4 Median :4.000 Median :2.000 Median :1.0000 Median :0.9023
## Mean :4 Mean :3.104 Mean :2.498 Mean :1.0000 Mean :0.8143
## 3rd Qu.:4 3rd Qu.:4.000 3rd Qu.:3.000 3rd Qu.:1.0000 3rd Qu.:0.9911
## Max. :4 Max. :4.000 Max. :4.000 Max. :1.0000 Max. :1.0000
## coverage_250_25 coverage_1000_100 state county
## Min. :0.0000 Min. :0.0000 Length:3143 Length:3143
## 1st Qu.:0.5050 1st Qu.:0.0476 Class :character Class :character
## Median :0.7453 Median :0.2287 Mode :character Mode :character
## Mean :0.6748 Mean :0.3071
## 3rd Qu.:0.8991 3rd Qu.:0.4974
## Max. :1.0000 Max. :1.0000
## svi_overall svi_soc svi_hh svi_min
## Min. :0.0000 Min. :0.00 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.2500 1st Qu.:0.25 1st Qu.:0.2500 1st Qu.:0.2454
## Median :0.4997 Median :0.50 Median :0.4997 Median :0.4981
## Mean :0.5000 Mean :0.50 Mean :0.5000 Mean :0.4988
## 3rd Qu.:0.7500 3rd Qu.:0.75 3rd Qu.:0.7500 3rd Qu.:0.7495
## Max. :1.0000 Max. :1.00 Max. :1.0000 Max. :1.0000
## svi_hous
## Min. :0.0000
## 1st Qu.:0.2500
## Median :0.5000
## Mean :0.5000
## 3rd Qu.:0.7499
## Max. :1.0000
write_csv(merged_data, path_processed("merged_fcc_svi_2020.csv"))
saveRDS(merged_data, path_processed("merged_fcc_svi_2020.rds"))
Microsoft’s Airband dataset provides estimates of actual broadband usage based on Windows device telemetry. While FCC data describes where broadband service is available (deployment), Airband data describes where broadband is actually used (adoption). Together, these allow us to compare access (FCC) and usage (Airband) across U.S. counties.
airband_raw <- read_csv(
path_raw("microsoft", "broadband_data_2020.csv")
)
names(airband_raw)
## [1] "ST" "COUNTY ID"
## [3] "COUNTY NAME" "BROADBAND AVAILABILITY PER FCC"
## [5] "BROADBAND USAGE"
glimpse(airband_raw)
## Rows: 3,142
## Columns: 5
## $ ST <chr> "AL", "AL", "AL", "AL", "AL", "AL", "…
## $ `COUNTY ID` <dbl> 1001, 1003, 1005, 1007, 1009, 1011, 1…
## $ `COUNTY NAME` <chr> "Autauga County", "Baldwin County", "…
## $ `BROADBAND AVAILABILITY PER FCC` <chr> "0.8057", "0.8362", "0.6891", "0.3368…
## $ `BROADBAND USAGE` <chr> "0.391", "0.452", "0.324", "0.136", "…
airband_clean <- airband_raw %>%
# Standardize column names
rename_with(~ .x |>
str_trim() |>
str_replace_all("\\s+", "_") |>
str_to_lower()) %>%
# Now expect columns:
# st, county_id, county_name, broadband_availability_per_fcc, broadband_usage
mutate(
county_fips = str_pad(as.character(county_id), 5, pad = "0")
) %>%
select(
county_fips,
state = st,
county_name,
airband_fcc_availability = broadband_availability_per_fcc,
airband_usage = broadband_usage
)
# Inspect
names(airband_clean)
## [1] "county_fips" "state"
## [3] "county_name" "airband_fcc_availability"
## [5] "airband_usage"
head(airband_clean)
## # A tibble: 6 × 5
## county_fips state county_name airband_fcc_availability airband_usage
## <chr> <chr> <chr> <chr> <chr>
## 1 01001 AL Autauga County 0.8057 0.391
## 2 01003 AL Baldwin County 0.8362 0.452
## 3 01005 AL Barbour County 0.6891 0.324
## 4 01007 AL Bibb County 0.3368 0.136
## 5 01009 AL Blount County 0.758 0.199
## 6 01011 AL Bullock County 0.9363 0.157
summary(airband_clean$airband_usage)
## Length Class Mode
## 3142 character character
# Load cleaned Microsoft Airband dataset
airband_2020 <- readRDS("processed_data/airband_2020_county_clean.rds")
# Overview of structure
glimpse(airband_2020)
## Rows: 3,142
## Columns: 5
## $ county_fips <chr> "01001", "01003", "01005", "01007", "01009", …
## $ state <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL…
## $ county_name <chr> "Autauga County", "Baldwin County", "Barbour …
## $ airband_fcc_availability <chr> "0.8057", "0.8362", "0.6891", "0.3368", "0.75…
## $ airband_usage <chr> "0.391", "0.452", "0.324", "0.136", "0.199", …
# Use county_fips (not FIPS)
airband_check <- airband_2020 %>%
select(
county_fips,
airband_fcc_availability,
airband_usage
)
# Convert availability and usage to numeric if stored as characters
airband_check <- airband_check %>%
mutate(
airband_fcc_availability = as.numeric(airband_fcc_availability),
airband_usage = as.numeric(airband_usage)
)
# 1) Summary statistics
summary(airband_check)
## county_fips airband_fcc_availability airband_usage
## Length:3142 Min. :0.0002 Min. :0.0010
## Class :character 1st Qu.:0.7700 1st Qu.:0.1960
## Mode :character Median :0.9242 Median :0.3680
## Mean :0.8404 Mean :0.3896
## 3rd Qu.:0.9839 3rd Qu.:0.5660
## Max. :1.0000 Max. :1.0000
## NA's :9 NA's :11
# 2) Range validation
invalid_airband_range <- airband_check %>%
filter(
airband_fcc_availability < 0 | airband_fcc_availability > 1 |
airband_usage < 0 | airband_usage > 1
)
invalid_airband_range # ideally 0 rows
## # A tibble: 0 × 3
## # ℹ 3 variables: county_fips <chr>, airband_fcc_availability <dbl>,
## # airband_usage <dbl>
# 3) Missingness check
colSums(is.na(airband_check))
## county_fips airband_fcc_availability airband_usage
## 0 9 11
# 4) Consistency check: usage > availability
airband_usage_gt_avail <- airband_check %>%
filter(
!is.na(airband_fcc_availability),
!is.na(airband_usage),
airband_usage > airband_fcc_availability
)
# Show examples
airband_usage_gt_avail %>% head()
## # A tibble: 6 × 3
## county_fips airband_fcc_availability airband_usage
## <chr> <dbl> <dbl>
## 1 01063 0.0085 0.013
## 2 01105 0.0012 0.032
## 3 02185 0.0061 0.051
## 4 02188 0.0081 0.039
## 5 02198 0.0282 0.04
## 6 02290 0.013 0.139
nrow(airband_usage_gt_avail)
## [1] 50
Airband Data Validation Summary
The Microsoft Airband dataset was validated after cleaning and aggregation to the county level. Both broadband indicators—modeled FCC availability and estimated household usage—fall entirely within the expected 0–1 range, with missing values limited to counties in Alaska and U.S. territories where Airband suppresses estimates. A subset of counties exhibit usage values that exceed modeled availability. This behavior is well-documented in the Airband methodology and reflects differences between modeled infrastructure coverage and actual adoption (including satellite-based broadband). These records were retained, as they represent genuine modeling differences rather than data errors. (# Need to check this)
write_csv(airband_clean, path_processed("airband_2020_county_clean.csv"))
saveRDS(airband_clean, path_processed("airband_2020_county_clean.rds"))
Ookla provides actual measured internet performance data from Speedtest users. Unlike FCC availability data (which measures where broadband could be offered) or Microsoft Airband usage data, Ookla data captures real-world speeds experienced by consumers. This includes download/upload speeds and latency.
The raw Ookla data is tile-based (millions of geographic tiles). We pre-aggregated this to county level using spatial joins, creating weighted averages by number of speed tests.
# Load pre-aggregated Ookla data (created by scripts/aggregate_ookla_to_county.R)
ookla_2020 <- readRDS(path_processed("ookla_2020_county_clean.rds"))
# Inspect structure
glimpse(ookla_2020)
## Rows: 3,124
## Columns: 9
## $ county_fips <chr> "01001", "01003", "01005", "01007", "01009", "010…
## $ avg_download_mbps <dbl> 139.27879, 105.00465, 99.15987, 37.20531, 56.9981…
## $ avg_upload_mbps <dbl> 66.695142, 55.552737, 11.026225, 22.346111, 23.61…
## $ avg_latency_ms <dbl> 22.40162, 50.51657, 55.41985, 156.95415, 35.75194…
## $ median_download_mbps <dbl> 111.8190, 45.4395, 36.9190, 16.5380, 23.6010, 15.…
## $ median_upload_mbps <dbl> 15.2790, 11.1445, 6.9430, 3.2920, 4.3810, 3.0405,…
## $ total_tests <int> 3095, 15144, 262, 458, 3729, 187, 568, 3224, 900,…
## $ total_devices <int> 833, 5026, 94, 156, 903, 39, 139, 1233, 211, 342,…
## $ n_tiles <int> 325, 1860, 70, 117, 611, 36, 96, 716, 156, 268, 2…
head(ookla_2020)
## # A tibble: 6 × 9
## county_fips avg_download_mbps avg_upload_mbps avg_latency_ms
## <chr> <dbl> <dbl> <dbl>
## 1 01001 139. 66.7 22.4
## 2 01003 105. 55.6 50.5
## 3 01005 99.2 11.0 55.4
## 4 01007 37.2 22.3 157.
## 5 01009 57.0 23.6 35.8
## 6 01011 20.6 7.40 41.0
## # ℹ 5 more variables: median_download_mbps <dbl>, median_upload_mbps <dbl>,
## # total_tests <int>, total_devices <int>, n_tiles <int>
# Summary statistics
cat("=== OOKLA SPEEDTEST SUMMARY (2020) ===\n\n")
## === OOKLA SPEEDTEST SUMMARY (2020) ===
cat("Counties with data:", nrow(ookla_2020), "\n\n")
## Counties with data: 3124
cat("Download Speed (Mbps):\n")
## Download Speed (Mbps):
print(summary(ookla_2020$avg_download_mbps))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.448 42.141 76.151 80.142 113.324 339.268
cat("\nUpload Speed (Mbps):\n")
##
## Upload Speed (Mbps):
print(summary(ookla_2020$avg_upload_mbps))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.217 9.666 16.013 26.432 33.977 200.109
cat("\nLatency (ms):\n")
##
## Latency (ms):
print(summary(ookla_2020$avg_latency_ms))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.70 25.61 40.27 56.56 64.41 745.39
# Distribution plots
library(ggplot2)
p1 <- ggplot(ookla_2020, aes(x = avg_download_mbps)) +
geom_histogram(bins = 50, fill = "steelblue", alpha = 0.7) +
labs(title = "Distribution of County-Level Download Speeds",
x = "Average Download Speed (Mbps)", y = "Count") +
theme_minimal()
p2 <- ggplot(ookla_2020, aes(x = avg_latency_ms)) +
geom_histogram(bins = 50, fill = "coral", alpha = 0.7) +
labs(title = "Distribution of County-Level Latency",
x = "Average Latency (ms)", y = "Count") +
theme_minimal()
print(p1)
print(p2)
# Check for reasonable ranges
invalid_ookla <- ookla_2020 %>%
filter(
avg_download_mbps < 0 | avg_download_mbps > 1000 |
avg_upload_mbps < 0 | avg_upload_mbps > 500 |
avg_latency_ms < 0 | avg_latency_ms > 1000
)
cat("Invalid Ookla records:", nrow(invalid_ookla), "\n")
## Invalid Ookla records: 0
# Check sample sizes
cat("\nSample size distribution (tests per county):\n")
##
## Sample size distribution (tests per county):
print(summary(ookla_2020$total_tests))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 331 1061 6869 3484 938240
# Counties with low sample sizes
low_sample <- ookla_2020 %>% filter(total_tests < 100)
cat("\nCounties with <100 tests:", nrow(low_sample), "\n")
##
## Counties with <100 tests: 281
The USDA Economic Research Service provides Rural-Urban Continuum Codes (RUCC) that classify counties by their degree of urbanization and proximity to metropolitan areas. Codes range from 1 (most urban/metro) to 9 (most rural/remote). This classification is essential for understanding the rural-urban dimension of the digital divide.
# Load pre-processed USDA data (created by scripts/process_usda_rucc.R)
usda_rucc <- readRDS(path_processed("usda_rucc_2023_clean.rds"))
cat("USDA RUCC 2023 Data:\n")
## USDA RUCC 2023 Data:
cat("Counties:", nrow(usda_rucc), "\n\n")
## Counties: 3235
glimpse(usda_rucc)
## Rows: 3,235
## Columns: 9
## $ county_fips <chr> "01001", "01003", "01005", "01007", "01009", "01011",…
## $ state_abbr <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL",…
## $ county_name <chr> "Autauga County", "Baldwin County", "Barbour County",…
## $ population_2020 <dbl> 58805, 231767, 25223, 22293, 59134, 10357, 19051, 116…
## $ rucc_2023 <int> 2, 3, 6, 1, 1, 8, 6, 3, 6, 8, 1, 9, 9, 9, 8, 4, 3, 9,…
## $ rucc_description <chr> "Metro - Counties in metro areas of 250,000 to 1 mill…
## $ is_metro <lgl> TRUE, TRUE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, FA…
## $ is_rural <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, …
## $ rural_urban_cat <chr> "Metro", "Metro", "Nonmetro-Adjacent", "Metro", "Metr…
cat("RUCC Code Distribution:\n")
## RUCC Code Distribution:
print(table(usda_rucc$rucc_2023))
##
## 1 2 3 4 5 6 7 8 9
## 483 398 371 204 81 385 248 468 595
cat("\nRural-Urban Category:\n")
##
## Rural-Urban Category:
print(table(usda_rucc$rural_urban_cat))
##
## Metro Nonmetro-Adjacent Nonmetro-Nonadjacent
## 1252 670 1313
# Visualization
ggplot(usda_rucc, aes(x = factor(rucc_2023), fill = rural_urban_cat)) +
geom_bar() +
labs(title = "Distribution of USDA Rural-Urban Continuum Codes",
x = "RUCC Code (1=Most Urban, 9=Most Rural)",
y = "Number of Counties",
fill = "Category") +
scale_fill_manual(values = c("Metro" = "steelblue",
"Nonmetro-Adjacent" = "orange",
"Nonmetro-Nonadjacent" = "darkred")) +
theme_minimal()
# Show unique descriptions for each RUCC code
rucc_defs <- usda_rucc %>%
select(rucc_2023, rucc_description) %>%
distinct() %>%
arrange(rucc_2023)
print(rucc_defs)
## # A tibble: 10 × 2
## rucc_2023 rucc_description
## <int> <chr>
## 1 1 Metro - Counties in metro areas of 1 million population or more
## 2 2 Metro - Counties in metro areas of 250,000 to 1 million population
## 3 3 Metro - Counties in metro areas of fewer than 250,000 population
## 4 4 Nonmetro - Urban population of 20,000 or more, adjacent to a metro…
## 5 5 Nonmetro - Urban population of 20,000 or more, not adjacent to a m…
## 6 6 Nonmetro - Urban population of 5,000 to 20,000, adjacent to a metr…
## 7 7 Nonmetro - Urban population of 5,000 to 20,000, not adjacent to a …
## 8 8 Nonmetro - Urban population of fewer than 5,000, adjacent to a met…
## 9 9 Nonmetro - Urban population of fewer than 5,000, not adjacent to a…
## 10 NA Not Applicable
fcc_2020 <- readRDS(path_processed(fcc_file)) # Uses configured FCC year
svi_2020 <- readRDS(path_processed("svi_2020_county_clean.rds"))
airband_clean <- readRDS(path_processed("airband_2020_county_clean.rds"))
ookla_2020 <- readRDS(path_processed("ookla_2020_county_clean.rds"))
usda_rucc <- readRDS(path_processed("usda_rucc_2023_clean.rds"))
# Row counts for confirmation
cat("FCC data:", fcc_year_label, "-", nrow(fcc_2020), "counties\n")
## FCC data: May 2025 - 3232 counties
cat("SVI counties:", nrow(svi_2020), "\n")
## SVI counties: 3143
cat("Airband counties:", nrow(airband_clean), "\n")
## Airband counties: 3142
cat("Ookla counties:", nrow(ookla_2020), "\n")
## Ookla counties: 3124
cat("USDA RUCC counties:", nrow(usda_rucc), "\n")
## USDA RUCC counties: 3235
missing_airband <- merged_data %>%
anti_join(airband_clean, by = "county_fips")
head(missing_airband)
## # A tibble: 2 × 18
## county_fips state_name county_name housing_units tier1 tier2 tier3
## <chr> <chr> <chr> <dbl> <int> <int> <int>
## 1 02063 AK Chugach Census Area 4074 4 2 1
## 2 02066 AK Copper River Census Ar… 3898 4 4 2
## # ℹ 11 more variables: coverage_25_3 <dbl>, coverage_100_20 <dbl>,
## # coverage_250_25 <dbl>, coverage_1000_100 <dbl>, state <chr>, county <chr>,
## # svi_overall <dbl>, svi_soc <dbl>, svi_hh <dbl>, svi_min <dbl>,
## # svi_hous <dbl>
nrow(missing_airband)
## [1] 2
# Merge Airband
merged_all <- merged_data %>%
inner_join(airband_clean, by = "county_fips")
# Merge Ookla (left join to keep all counties even if no Ookla data)
merged_all <- merged_all %>%
left_join(ookla_2020, by = "county_fips")
# Merge USDA RUCC (left join, select only key variables to avoid name conflicts)
usda_for_merge <- usda_rucc %>%
select(county_fips, rucc_2023, is_metro, is_rural, rural_urban_cat)
merged_all <- merged_all %>%
left_join(usda_for_merge, by = "county_fips")
# Inspect final dataset
cat("Final merged dataset rows:", nrow(merged_all), "\n")
## Final merged dataset rows: 3141
cat("Counties with Ookla data:", sum(!is.na(merged_all$avg_download_mbps)), "\n")
## Counties with Ookla data: 3122
cat("Counties with USDA RUCC:", sum(!is.na(merged_all$rucc_2023)), "\n\n")
## Counties with USDA RUCC: 3133
# Rural-urban breakdown
cat("Metro vs Nonmetro in merged data:\n")
## Metro vs Nonmetro in merged data:
print(table(merged_all$rural_urban_cat, useNA = "ifany"))
##
## Metro Nonmetro-Adjacent Nonmetro-Nonadjacent
## 1179 654 1300
## <NA>
## 8
head(merged_all)
## # A tibble: 6 × 34
## county_fips state_name county_name.x housing_units tier1 tier2 tier3
## <chr> <chr> <chr> <dbl> <int> <int> <int>
## 1 01001 AL Autauga County 27544 4 4 4
## 2 01003 AL Baldwin County 137072 4 2 2
## 3 01005 AL Barbour County 15445 4 2 2
## 4 01007 AL Bibb County 11168 4 1 1
## 5 01009 AL Blount County 31265 4 1 1
## 6 01011 AL Bullock County 6136 4 3 1
## # ℹ 27 more variables: coverage_25_3 <dbl>, coverage_100_20 <dbl>,
## # coverage_250_25 <dbl>, coverage_1000_100 <dbl>, state.x <chr>,
## # county <chr>, svi_overall <dbl>, svi_soc <dbl>, svi_hh <dbl>,
## # svi_min <dbl>, svi_hous <dbl>, state.y <chr>, county_name.y <chr>,
## # airband_fcc_availability <chr>, airband_usage <chr>,
## # avg_download_mbps <dbl>, avg_upload_mbps <dbl>, avg_latency_ms <dbl>,
## # median_download_mbps <dbl>, median_upload_mbps <dbl>, total_tests <int>, …
# Save with updated name reflecting all integrations
write_csv(merged_all, path_processed("merged_master_2020.csv"))
saveRDS(merged_all, path_processed("merged_master_2020.rds"))
# Also save to original names for backwards compatibility
write_csv(merged_all, path_processed("merged_fcc_svi_airband_ookla_2020.csv"))
saveRDS(merged_all, path_processed("merged_fcc_svi_airband_ookla_2020.rds"))
write_csv(merged_all, path_processed("merged_fcc_svi_airband_2020.csv"))
saveRDS(merged_all, path_processed("merged_fcc_svi_airband_2020.rds"))
cat("Saved merged dataset with", ncol(merged_all), "columns\n")
## Saved merged dataset with 34 columns
This section provides comprehensive visualizations exploring the digital divide across rural-urban classifications, income levels, and geographic regions.
# Prepare data with RUCC categories
viz_data <- merged_all %>%
filter(!is.na(rural_urban_cat), !is.na(avg_download_mbps))
# Box plot: Download speeds by rural-urban category
p1 <- ggplot(viz_data, aes(x = rural_urban_cat, y = avg_download_mbps, fill = rural_urban_cat)) +
geom_boxplot(alpha = 0.7, outlier.alpha = 0.3) +
scale_fill_manual(values = c("Metro" = "#2E86AB",
"Nonmetro-Adjacent" = "#F6AE2D",
"Nonmetro-Nonadjacent" = "#E94F37")) +
labs(title = "Download Speeds by Rural-Urban Classification",
subtitle = "Ookla Speedtest Data (2020)",
x = "", y = "Average Download Speed (Mbps)") +
theme_minimal() +
theme(legend.position = "none",
plot.title = element_text(face = "bold", size = 14))
# Box plot: Latency by rural-urban category
p2 <- ggplot(viz_data, aes(x = rural_urban_cat, y = avg_latency_ms, fill = rural_urban_cat)) +
geom_boxplot(alpha = 0.7, outlier.alpha = 0.3) +
scale_fill_manual(values = c("Metro" = "#2E86AB",
"Nonmetro-Adjacent" = "#F6AE2D",
"Nonmetro-Nonadjacent" = "#E94F37")) +
labs(title = "Network Latency by Rural-Urban Classification",
subtitle = "Higher latency = slower response times",
x = "", y = "Average Latency (ms)") +
theme_minimal() +
theme(legend.position = "none",
plot.title = element_text(face = "bold", size = 14))
gridExtra::grid.arrange(p1, p2, ncol = 2)
viz_data2 <- merged_all %>%
filter(!is.na(rucc_2023), !is.na(avg_download_mbps)) %>%
mutate(rucc_label = paste0("RUCC ", rucc_2023))
ggplot(viz_data2, aes(x = avg_download_mbps, fill = factor(rucc_2023))) +
geom_density(alpha = 0.5) +
scale_fill_viridis_d(name = "RUCC Code", option = "plasma") +
labs(title = "Download Speed Distribution by RUCC Code",
subtitle = "RUCC 1-3: Metro | RUCC 4-6: Nonmetro Adjacent | RUCC 7-9: Remote Rural",
x = "Average Download Speed (Mbps)", y = "Density") +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 14)) +
xlim(0, 200)
# Compare FCC reported coverage with actual measured speeds
viz_data3 <- merged_all %>%
filter(!is.na(coverage_25_3), !is.na(avg_download_mbps))
if (nrow(viz_data3) > 0) {
ggplot(viz_data3, aes(x = coverage_25_3 * 100, y = avg_download_mbps, color = rural_urban_cat)) +
geom_point(alpha = 0.5, size = 2) +
geom_smooth(method = "lm", se = FALSE, linewidth = 1.2) +
scale_color_manual(values = c("Metro" = "#2E86AB",
"Nonmetro-Adjacent" = "#F6AE2D",
"Nonmetro-Nonadjacent" = "#E94F37")) +
labs(title = "FCC Reported Coverage vs Actual Measured Speeds",
subtitle = "Does high coverage translate to fast speeds?",
x = "FCC 25/3 Mbps Coverage (%)",
y = "Actual Avg Download Speed (Mbps)",
color = "Area Type") +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 14),
legend.position = "bottom")
} else {
cat("Note: Coverage data not available for this comparison\n")
}
# Select numeric variables for correlation
cor_vars <- merged_all %>%
select(
svi_overall, svi_soc, svi_hh, svi_min, svi_hous,
tier2, tier3,
avg_download_mbps, avg_upload_mbps, avg_latency_ms,
airband_usage, rucc_2023
) %>%
select(where(is.numeric)) %>%
na.omit() # Remove rows with NA for clean correlation
# Compute correlation matrix
cor_matrix <- cor(cor_vars, use = "complete.obs")
# Plot correlation heatmap (removed hclust ordering to avoid NA issues)
corrplot::corrplot(cor_matrix,
method = "color",
type = "upper",
order = "original",
tl.col = "black",
tl.srt = 45,
addCoef.col = "black",
number.cex = 0.7,
col = corrplot::COL2('RdBu', 10),
title = "Correlation Matrix: Digital Divide Indicators",
mar = c(0, 0, 2, 0))
# Create speed tier categories from actual Ookla speeds
viz_data5 <- merged_all %>%
filter(!is.na(avg_download_mbps), !is.na(rural_urban_cat)) %>%
mutate(
speed_tier = case_when(
avg_download_mbps < 25 ~ "Below Basic (<25 Mbps)",
avg_download_mbps < 100 ~ "Basic (25-100 Mbps)",
avg_download_mbps < 250 ~ "Fast (100-250 Mbps)",
TRUE ~ "Very Fast (250+ Mbps)"
),
speed_tier = factor(speed_tier, levels = c("Below Basic (<25 Mbps)",
"Basic (25-100 Mbps)",
"Fast (100-250 Mbps)",
"Very Fast (250+ Mbps)"))
)
# Stacked bar chart
ggplot(viz_data5, aes(x = rural_urban_cat, fill = speed_tier)) +
geom_bar(position = "fill") +
scale_fill_manual(values = c("Below Basic (<25 Mbps)" = "#E94F37",
"Basic (25-100 Mbps)" = "#F6AE2D",
"Fast (100-250 Mbps)" = "#7FB069",
"Very Fast (250+ Mbps)" = "#2E86AB")) +
scale_y_continuous(labels = scales::percent) +
labs(title = "Distribution of Actual Internet Speeds by Area Type",
subtitle = "Based on Ookla Speedtest measurements",
x = "", y = "Proportion of Counties",
fill = "Speed Tier") +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 14),
legend.position = "bottom") +
coord_flip()
summary_stats <- merged_all %>%
filter(!is.na(rural_urban_cat)) %>%
group_by(rural_urban_cat) %>%
summarise(
n_counties = n(),
avg_download = round(mean(avg_download_mbps, na.rm = TRUE), 1),
median_download = round(median(avg_download_mbps, na.rm = TRUE), 1),
avg_latency = round(mean(avg_latency_ms, na.rm = TRUE), 1),
avg_svi = round(mean(svi_overall, na.rm = TRUE), 3),
pct_below_25mbps = round(100 * mean(avg_download_mbps < 25, na.rm = TRUE), 1),
.groups = "drop"
)
cat("\n=== DIGITAL DIVIDE SUMMARY BY AREA TYPE ===\n\n")
##
## === DIGITAL DIVIDE SUMMARY BY AREA TYPE ===
print(summary_stats)
## # A tibble: 3 × 7
## rural_urban_cat n_counties avg_download median_download avg_latency avg_svi
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 Metro 1179 110. 114. 43.1 0.476
## 2 Nonmetro-Adjacent 654 82.1 80.6 52.4 0.58
## 3 Nonmetro-Nonadjac… 1300 51.8 43.5 71.3 0.482
## # ℹ 1 more variable: pct_below_25mbps <dbl>
# Visual table using knitr if available
if (requireNamespace("knitr", quietly = TRUE)) {
knitr::kable(summary_stats,
col.names = c("Area Type", "Counties", "Avg DL (Mbps)",
"Median DL", "Avg Latency (ms)", "Avg SVI", "% Below 25Mbps"),
caption = "Digital Divide Metrics by Rural-Urban Classification")
}
| Area Type | Counties | Avg DL (Mbps) | Median DL | Avg Latency (ms) | Avg SVI | % Below 25Mbps |
|---|---|---|---|---|---|---|
| Metro | 1179 | 109.5 | 114.2 | 43.1 | 0.476 | 3.9 |
| Nonmetro-Adjacent | 654 | 82.1 | 80.6 | 52.4 | 0.580 | 2.8 |
| Nonmetro-Nonadjacent | 1300 | 51.8 | 43.5 | 71.3 | 0.482 | 25.1 |
# Note: This chunk creates a US map colored by RUCC codes
# Set eval=TRUE to generate (requires counties_sf shapefile)
# Load shapefile if available
if (file.exists(path_processed("counties_sf.rds"))) {
counties_sf <- readRDS(path_processed("counties_sf.rds"))
# Join with RUCC data
map_data <- counties_sf %>%
left_join(merged_all %>% select(county_fips, rucc_2023, rural_urban_cat),
by = c("GEOID" = "county_fips"))
# Filter to continental US
map_data <- map_data %>%
filter(!STATEFP %in% c("02", "15", "60", "66", "69", "72", "78"))
ggplot(map_data) +
geom_sf(aes(fill = rural_urban_cat), color = NA) +
scale_fill_manual(values = c("Metro" = "#2E86AB",
"Nonmetro-Adjacent" = "#F6AE2D",
"Nonmetro-Nonadjacent" = "#E94F37"),
na.value = "gray90") +
labs(title = "Rural-Urban Classification of US Counties",
subtitle = "USDA Rural-Urban Continuum Codes (2023)",
fill = "Category") +
theme_void() +
theme(plot.title = element_text(face = "bold", size = 16),
legend.position = "bottom")
}
To conduct spatial analysis and merge all county-level datasets consistently, we first processed the 2020 U.S. Census TIGER/Line county shapefile. This shapefile provides the geographic boundary polygons and the official county FIPS codes (GEOID) used as the primary spatial join key in later stages of the project.
sf::sf_use_s2(FALSE)
We now load the 2020 U.S. Census TIGER/Line county shapefile into R using the sf package. This file contains county boundary polygons and the GEOID field that will be used later for spatial joining with ACS, SVI, FCC, and Airband datasets.
shp_path <- "raw_data/geographic/tl_2020_us_county.shp"
counties_raw <- sf::st_read(shp_path)
## Reading layer `tl_2020_us_county' from data source
## `/Users/yogasundaramramaswamy/Downloads/digital_divide_project/raw_data/geographic/tl_2020_us_county.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 3234 features and 17 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -179.2311 ymin: -14.60181 xmax: 179.8597 ymax: 71.43979
## Geodetic CRS: NAD83
#7.3 Select Relevant Columns
The county shapefile contains many fields, but for this project we only need the attributes required for merging and mapping. Specifically, we keep:
GEOID – 5-digit county FIPS (primary join key)
NAME – county name
STATEFP – state FIPS code
COUNTYFP – county number within the state
geometry – polygon boundary information
counties_sf <- counties_raw %>%
dplyr::select(GEOID, NAME, STATEFP, COUNTYFP, geometry)
counties_sf
## Simple feature collection with 3234 features and 4 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -179.2311 ymin: -14.60181 xmax: 179.8597 ymax: 71.43979
## Geodetic CRS: NAD83
## First 10 features:
## GEOID NAME STATEFP COUNTYFP geometry
## 1 31039 Cuming 31 039 MULTIPOLYGON (((-97.01952 4...
## 2 53069 Wahkiakum 53 069 MULTIPOLYGON (((-123.4364 4...
## 3 35011 De Baca 35 011 MULTIPOLYGON (((-104.5674 3...
## 4 31109 Lancaster 31 109 MULTIPOLYGON (((-96.91075 4...
## 5 31129 Nuckolls 31 129 MULTIPOLYGON (((-98.27367 4...
## 6 72085 Las Piedras 72 085 MULTIPOLYGON (((-65.91048 1...
## 7 46099 Minnehaha 46 099 MULTIPOLYGON (((-96.89022 4...
## 8 48327 Menard 48 327 MULTIPOLYGON (((-99.82187 3...
## 9 06091 Sierra 06 091 MULTIPOLYGON (((-120.6556 3...
## 10 21053 Clinton 21 053 MULTIPOLYGON (((-85.2391 36...
Different datasets (SVI, ACS, FCC, Airband) use FIPS codes to identify counties. To ensure all datasets merge correctly, we convert the GEOID column from the shapefile into a 5-digit character string with leading zeros.
Example:
1001 → 01001
3501 → 03501
This prevents merge errors and ensures consistency across all data sources.
counties_sf <- counties_sf %>%
mutate(
GEOID = as.character(GEOID), # convert to character
GEOID = stringr::str_pad(GEOID, # pad with leading zeros
width = 5,
side = "left",
pad = "0")
)
# Display first 20 FIPS codes to verify correctness
head(counties_sf$GEOID, 20)
## [1] "31039" "53069" "35011" "31109" "31129" "72085" "46099" "48327" "06091"
## [10] "21053" "39063" "48189" "01027" "48011" "39003" "13189" "55111" "05137"
## [19] "41063" "42007"
The TIGER/Line county shapefile is in the NAD83 coordinate reference system. For most visualization tools (e.g., ggplot2, tmap, leaflet), it is convenient to use WGS84 (EPSG:4326). Here, we reproject the spatial object once so it is ready for mapping and spatial analysis.
counties_sf <- st_transform(counties_sf, crs = 4326)
# Quick check of the CRS and object
counties_sf
## Simple feature collection with 3234 features and 4 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -179.2311 ymin: -14.60181 xmax: 179.8597 ymax: 71.43979
## Geodetic CRS: WGS 84
## First 10 features:
## GEOID NAME STATEFP COUNTYFP geometry
## 1 31039 Cuming 31 039 MULTIPOLYGON (((-97.01952 4...
## 2 53069 Wahkiakum 53 069 MULTIPOLYGON (((-123.4364 4...
## 3 35011 De Baca 35 011 MULTIPOLYGON (((-104.5674 3...
## 4 31109 Lancaster 31 109 MULTIPOLYGON (((-96.91075 4...
## 5 31129 Nuckolls 31 129 MULTIPOLYGON (((-98.27367 4...
## 6 72085 Las Piedras 72 085 MULTIPOLYGON (((-65.91048 1...
## 7 46099 Minnehaha 46 099 MULTIPOLYGON (((-96.89022 4...
## 8 48327 Menard 48 327 MULTIPOLYGON (((-99.82187 3...
## 9 06091 Sierra 06 091 MULTIPOLYGON (((-120.6556 3...
## 10 21053 Clinton 21 053 MULTIPOLYGON (((-85.2391 36...
st_crs(counties_sf)
## Coordinate Reference System:
## User input: EPSG:4326
## wkt:
## GEOGCRS["WGS 84",
## ENSEMBLE["World Geodetic System 1984 ensemble",
## MEMBER["World Geodetic System 1984 (Transit)"],
## MEMBER["World Geodetic System 1984 (G730)"],
## MEMBER["World Geodetic System 1984 (G873)"],
## MEMBER["World Geodetic System 1984 (G1150)"],
## MEMBER["World Geodetic System 1984 (G1674)"],
## MEMBER["World Geodetic System 1984 (G1762)"],
## MEMBER["World Geodetic System 1984 (G2139)"],
## MEMBER["World Geodetic System 1984 (G2296)"],
## ELLIPSOID["WGS 84",6378137,298.257223563,
## LENGTHUNIT["metre",1]],
## ENSEMBLEACCURACY[2.0]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433]],
## CS[ellipsoidal,2],
## AXIS["geodetic latitude (Lat)",north,
## ORDER[1],
## ANGLEUNIT["degree",0.0174532925199433]],
## AXIS["geodetic longitude (Lon)",east,
## ORDER[2],
## ANGLEUNIT["degree",0.0174532925199433]],
## USAGE[
## SCOPE["Horizontal component of 3D system."],
## AREA["World."],
## BBOX[-90,-180,90,180]],
## ID["EPSG",4326]]
# Save Cleaned County Spatial Data as RDS
# Save directly into your existing processed_data folder
saveRDS(counties_sf, "processed_data/counties_sf.rds")
# Confirm it saved
list.files("processed_data")
## [1] "acs_all_2020_county_clean.csv"
## [2] "acs_all_2020_county_clean.rds"
## [3] "acs_b15003_2020_county_clean.csv"
## [4] "acs_b15003_2020_county_clean.rds"
## [5] "acs_b19013_2020_county_clean.csv"
## [6] "acs_b19013_2020_county_clean.rds"
## [7] "acs_b28002_2020_county_clean.csv"
## [8] "acs_b28002_2020_county_clean.rds"
## [9] "airband_2020_county_clean.csv"
## [10] "airband_2020_county_clean.rds"
## [11] "airband_inconsistency_counties.csv"
## [12] "analysis_2020_county_final_sf.rds"
## [13] "analysis_2020_county_final.csv"
## [14] "analysis_2020_county_final.rds"
## [15] "analysis_2020_county_sf.rds"
## [16] "analysis_2020_county.csv"
## [17] "analysis_2020_county.rds"
## [18] "counties_sf.rds"
## [19] "fcc_2020_dec_clean.csv"
## [20] "fcc_2020_dec_clean.rds"
## [21] "fcc_2025_may_clean.csv"
## [22] "fcc_2025_may_clean.rds"
## [23] "master_2020_county_sf.rds"
## [24] "master_2020_county.csv"
## [25] "master_2020_county.rds"
## [26] "merged_fcc_svi_2020.csv"
## [27] "merged_fcc_svi_2020.rds"
## [28] "merged_fcc_svi_airband_2020.csv"
## [29] "merged_fcc_svi_airband_2020.rds"
## [30] "merged_fcc_svi_airband_ookla_2020.csv"
## [31] "merged_fcc_svi_airband_ookla_2020.rds"
## [32] "merged_master_2020.csv"
## [33] "merged_master_2020.rds"
## [34] "ookla_2020_county_clean.csv"
## [35] "ookla_2020_county_clean.rds"
## [36] "svi_2020_county_clean.csv"
## [37] "svi_2020_county_clean.rds"
## [38] "usda_rucc_2023_clean.csv"
## [39] "usda_rucc_2023_clean.rds"
The American Community Survey (ACS) provides essential socioeconomic indicators that are critical for understanding digital inequality across U.S. counties. Unlike datasets that only measure broadband availability or usage, ACS tables contain detailed information about household income, educational attainment, and internet/computer access—all of which are deeply connected to digital divide outcomes.
The ACS table B28002 provides detailed information about internet and computer access at the county level. This dataset is essential for understanding digital adoption, device availability, and technology access across communities. However, the raw ACS download includes extra columns—such as margins of error (M columns), metadata, and a GEO_ID field containing long identifiers.
# Load ACS B28002
b28002_path <- "raw_data/census/ACSDT5Y2020.B28002-Data.csv"
acs_b28002_raw <- read_csv(b28002_path)
# Remove label row
acs_b28002_raw <- acs_b28002_raw %>%
filter(GEO_ID != "Geography")
# Keep only important columns
acs_b28002_clean <- acs_b28002_raw %>%
select(
GEO_ID, NAME,
B28002_001E, # total households
B28002_002E, # broadband households
B28002_012E, # households with no internet
B28002_013E # households with no computer
# OPTIONAL: add B28002_004E, _005E, _006E if needed
) %>%
mutate(
FIPS = str_sub(GEO_ID, -5, -1),
FIPS = str_pad(FIPS, 5, pad = "0")
) %>%
select(FIPS, GEO_ID, NAME, everything())
# Preview
head(acs_b28002_clean)
## # A tibble: 6 × 7
## FIPS GEO_ID NAME B28002_001E B28002_002E B28002_012E B28002_013E
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 01001 0500000US01001 Autauga … 21559 17850 402 3307
## 2 01003 0500000US01003 Baldwin … 84047 71880 2149 10018
## 3 01005 0500000US01005 Barbour … 9322 6059 563 2700
## 4 01007 0500000US01007 Bibb Cou… 7259 5529 210 1520
## 5 01009 0500000US01009 Blount C… 21205 16971 314 3920
## 6 01011 0500000US01011 Bullock … 3429 2153 236 1040
# Save to your existing processed_data folder
write_csv(acs_b28002_clean, "processed_data/acs_b28002_2020_county_clean.csv")
saveRDS(acs_b28002_clean, "processed_data/acs_b28002_2020_county_clean.rds")
# Confirm that the files saved
list.files("processed_data")
## [1] "acs_all_2020_county_clean.csv"
## [2] "acs_all_2020_county_clean.rds"
## [3] "acs_b15003_2020_county_clean.csv"
## [4] "acs_b15003_2020_county_clean.rds"
## [5] "acs_b19013_2020_county_clean.csv"
## [6] "acs_b19013_2020_county_clean.rds"
## [7] "acs_b28002_2020_county_clean.csv"
## [8] "acs_b28002_2020_county_clean.rds"
## [9] "airband_2020_county_clean.csv"
## [10] "airband_2020_county_clean.rds"
## [11] "airband_inconsistency_counties.csv"
## [12] "analysis_2020_county_final_sf.rds"
## [13] "analysis_2020_county_final.csv"
## [14] "analysis_2020_county_final.rds"
## [15] "analysis_2020_county_sf.rds"
## [16] "analysis_2020_county.csv"
## [17] "analysis_2020_county.rds"
## [18] "counties_sf.rds"
## [19] "fcc_2020_dec_clean.csv"
## [20] "fcc_2020_dec_clean.rds"
## [21] "fcc_2025_may_clean.csv"
## [22] "fcc_2025_may_clean.rds"
## [23] "master_2020_county_sf.rds"
## [24] "master_2020_county.csv"
## [25] "master_2020_county.rds"
## [26] "merged_fcc_svi_2020.csv"
## [27] "merged_fcc_svi_2020.rds"
## [28] "merged_fcc_svi_airband_2020.csv"
## [29] "merged_fcc_svi_airband_2020.rds"
## [30] "merged_fcc_svi_airband_ookla_2020.csv"
## [31] "merged_fcc_svi_airband_ookla_2020.rds"
## [32] "merged_master_2020.csv"
## [33] "merged_master_2020.rds"
## [34] "ookla_2020_county_clean.csv"
## [35] "ookla_2020_county_clean.rds"
## [36] "svi_2020_county_clean.csv"
## [37] "svi_2020_county_clean.rds"
## [38] "usda_rucc_2023_clean.csv"
## [39] "usda_rucc_2023_clean.rds"
ACS table B19013 contains the median household income for each U.S. county. Income is one of the strongest predictors of digital access, affordability, and adoption — making it a critical variable in your digital divide analysis.
Because the raw ACS file includes:
metadata rows
margin of error (_M) columns
long GEO_ID fields
We must clean and standardize the dataset before merging it with other county-level data.
# 8.2 Load raw ACS B19013 data
b19013_path <- "raw_data/census/ACSDT5Y2020.B19013-Data.csv"
acs_b19013_raw <- read_csv(b19013_path)
# Remove the label row
acs_b19013_raw <- acs_b19013_raw %>%
filter(GEO_ID != "Geography")
# Keep only GEO_ID, NAME, and the income estimate column
acs_b19013_clean <- acs_b19013_raw %>%
select(
GEO_ID,
NAME,
B19013_001E # median household income
) %>%
mutate(
FIPS = str_sub(GEO_ID, -5, -1),
FIPS = str_pad(FIPS, 5, pad = "0")
) %>%
select(FIPS, GEO_ID, NAME, everything())
# Preview cleaned file
head(acs_b19013_clean)
## # A tibble: 6 × 4
## FIPS GEO_ID NAME B19013_001E
## <chr> <chr> <chr> <chr>
## 1 01001 0500000US01001 Autauga County, Alabama 57982
## 2 01003 0500000US01003 Baldwin County, Alabama 61756
## 3 01005 0500000US01005 Barbour County, Alabama 34990
## 4 01007 0500000US01007 Bibb County, Alabama 51721
## 5 01009 0500000US01009 Blount County, Alabama 48922
## 6 01011 0500000US01011 Bullock County, Alabama 33866
write_csv(acs_b19013_clean, "processed_data/acs_b19013_2020_county_clean.csv")
saveRDS(acs_b19013_clean, "processed_data/acs_b19013_2020_county_clean.rds")
# Confirm saved files
list.files("processed_data")
## [1] "acs_all_2020_county_clean.csv"
## [2] "acs_all_2020_county_clean.rds"
## [3] "acs_b15003_2020_county_clean.csv"
## [4] "acs_b15003_2020_county_clean.rds"
## [5] "acs_b19013_2020_county_clean.csv"
## [6] "acs_b19013_2020_county_clean.rds"
## [7] "acs_b28002_2020_county_clean.csv"
## [8] "acs_b28002_2020_county_clean.rds"
## [9] "airband_2020_county_clean.csv"
## [10] "airband_2020_county_clean.rds"
## [11] "airband_inconsistency_counties.csv"
## [12] "analysis_2020_county_final_sf.rds"
## [13] "analysis_2020_county_final.csv"
## [14] "analysis_2020_county_final.rds"
## [15] "analysis_2020_county_sf.rds"
## [16] "analysis_2020_county.csv"
## [17] "analysis_2020_county.rds"
## [18] "counties_sf.rds"
## [19] "fcc_2020_dec_clean.csv"
## [20] "fcc_2020_dec_clean.rds"
## [21] "fcc_2025_may_clean.csv"
## [22] "fcc_2025_may_clean.rds"
## [23] "master_2020_county_sf.rds"
## [24] "master_2020_county.csv"
## [25] "master_2020_county.rds"
## [26] "merged_fcc_svi_2020.csv"
## [27] "merged_fcc_svi_2020.rds"
## [28] "merged_fcc_svi_airband_2020.csv"
## [29] "merged_fcc_svi_airband_2020.rds"
## [30] "merged_fcc_svi_airband_ookla_2020.csv"
## [31] "merged_fcc_svi_airband_ookla_2020.rds"
## [32] "merged_master_2020.csv"
## [33] "merged_master_2020.rds"
## [34] "ookla_2020_county_clean.csv"
## [35] "ookla_2020_county_clean.rds"
## [36] "svi_2020_county_clean.csv"
## [37] "svi_2020_county_clean.rds"
## [38] "usda_rucc_2023_clean.csv"
## [39] "usda_rucc_2023_clean.rds"
ACS table B15003 provides counts of the population by educational attainment level. For digital divide research, education is one of the most important predictors of digital literacy, employment opportunities, and broadband adoption.
library(dplyr)
library(readr)
library(stringr)
# 8.3 Load raw ACS B15003 education data
b15003_path <- "raw_data/census/ACSDT5Y2020.B15003-Data.csv"
acs_b15003_raw <- read_csv(b15003_path)
# Remove descriptive header row
acs_b15003_raw <- acs_b15003_raw %>%
filter(GEO_ID != "Geography")
# Keep key educational variables
acs_b15003_clean <- acs_b15003_raw %>%
select(
GEO_ID,
NAME,
B15003_001E, # total population age 25+
B15003_017E, # high school graduate
B15003_022E, # bachelor's degree
B15003_023E, # master's degree
B15003_025E # doctorate degree
) %>%
mutate(
FIPS = str_sub(GEO_ID, -5, -1),
FIPS = str_pad(FIPS, 5, pad = "0")
) %>%
select(FIPS, GEO_ID, NAME, everything())
# Preview
head(acs_b15003_clean)
## # A tibble: 6 × 8
## FIPS GEO_ID NAME B15003_001E B15003_017E B15003_022E B15003_023E B15003_025E
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 01001 05000… Auta… 37860 9663 6320 3399 498
## 2 01003 05000… Bald… 155563 34521 31444 13331 2016
## 3 01005 05000… Barb… 17797 5171 1296 540 113
## 4 01007 05000… Bibb… 15987 5879 1183 481 41
## 5 01009 05000… Blou… 39814 10582 3540 1382 132
## 6 01011 05000… Bull… 6784 2498 413 152 28
write_csv(acs_b15003_clean, "processed_data/acs_b15003_2020_county_clean.csv")
saveRDS(acs_b15003_clean, "processed_data/acs_b15003_2020_county_clean.rds")
# Confirm saved files
list.files("processed_data")
## [1] "acs_all_2020_county_clean.csv"
## [2] "acs_all_2020_county_clean.rds"
## [3] "acs_b15003_2020_county_clean.csv"
## [4] "acs_b15003_2020_county_clean.rds"
## [5] "acs_b19013_2020_county_clean.csv"
## [6] "acs_b19013_2020_county_clean.rds"
## [7] "acs_b28002_2020_county_clean.csv"
## [8] "acs_b28002_2020_county_clean.rds"
## [9] "airband_2020_county_clean.csv"
## [10] "airband_2020_county_clean.rds"
## [11] "airband_inconsistency_counties.csv"
## [12] "analysis_2020_county_final_sf.rds"
## [13] "analysis_2020_county_final.csv"
## [14] "analysis_2020_county_final.rds"
## [15] "analysis_2020_county_sf.rds"
## [16] "analysis_2020_county.csv"
## [17] "analysis_2020_county.rds"
## [18] "counties_sf.rds"
## [19] "fcc_2020_dec_clean.csv"
## [20] "fcc_2020_dec_clean.rds"
## [21] "fcc_2025_may_clean.csv"
## [22] "fcc_2025_may_clean.rds"
## [23] "master_2020_county_sf.rds"
## [24] "master_2020_county.csv"
## [25] "master_2020_county.rds"
## [26] "merged_fcc_svi_2020.csv"
## [27] "merged_fcc_svi_2020.rds"
## [28] "merged_fcc_svi_airband_2020.csv"
## [29] "merged_fcc_svi_airband_2020.rds"
## [30] "merged_fcc_svi_airband_ookla_2020.csv"
## [31] "merged_fcc_svi_airband_ookla_2020.rds"
## [32] "merged_master_2020.csv"
## [33] "merged_master_2020.rds"
## [34] "ookla_2020_county_clean.csv"
## [35] "ookla_2020_county_clean.rds"
## [36] "svi_2020_county_clean.csv"
## [37] "svi_2020_county_clean.rds"
## [38] "usda_rucc_2023_clean.csv"
## [39] "usda_rucc_2023_clean.rds"
library(dplyr)
library(readr)
library(stringr)
# 8.4.1 Load cleaned ACS datasets
acs_b28002 <- readRDS("processed_data/acs_b28002_2020_county_clean.rds")
acs_b19013 <- readRDS("processed_data/acs_b19013_2020_county_clean.rds")
acs_b15003 <- readRDS("processed_data/acs_b15003_2020_county_clean.rds")
# 8.4.2 Prepare B28002 (internet/computer)
acs_b28002_small <- acs_b28002 %>%
dplyr::rename(
internet_total_hh = B28002_001E,
internet_broadband = B28002_002E,
internet_no_access = B28002_012E,
computer_no_device = B28002_013E
)
# 8.4.3 Prepare B19013 (income)
acs_b19013_small <- acs_b19013 %>%
dplyr::select(
FIPS,
income_median = B19013_001E
)
# 8.4.4 Prepare B15003 (education)
acs_b15003_small <- acs_b15003 %>%
dplyr::select(
FIPS,
edu_total_25plus = B15003_001E,
edu_hs = B15003_017E,
edu_bach = B15003_022E,
edu_mast = B15003_023E,
edu_doc = B15003_025E
)
# 8.4.5 Merge all three ACS datasets
acs_2020_all <- acs_b28002_small %>%
dplyr::left_join(acs_b19013_small, by = "FIPS") %>%
dplyr::left_join(acs_b15003_small, by = "FIPS")
glimpse(acs_2020_all)
## Rows: 3,221
## Columns: 13
## $ FIPS <chr> "01001", "01003", "01005", "01007", "01009", "01011…
## $ GEO_ID <chr> "0500000US01001", "0500000US01003", "0500000US01005…
## $ NAME <chr> "Autauga County, Alabama", "Baldwin County, Alabama…
## $ internet_total_hh <chr> "21559", "84047", "9322", "7259", "21205", "3429", …
## $ internet_broadband <chr> "17850", "71880", "6059", "5529", "16971", "2153", …
## $ internet_no_access <chr> "402", "2149", "563", "210", "314", "236", "150", "…
## $ computer_no_device <chr> "3307", "10018", "2700", "1520", "3920", "1040", "1…
## $ income_median <chr> "57982", "61756", "34990", "51721", "48922", "33866…
## $ edu_total_25plus <chr> "37860", "155563", "17797", "15987", "39814", "6784…
## $ edu_hs <chr> "9663", "34521", "5171", "5879", "10582", "2498", "…
## $ edu_bach <chr> "6320", "31444", "1296", "1183", "3540", "413", "13…
## $ edu_mast <chr> "3399", "13331", "540", "481", "1382", "152", "651"…
## $ edu_doc <chr> "498", "2016", "113", "41", "132", "28", "19", "883…
head(acs_2020_all)
## # A tibble: 6 × 13
## FIPS GEO_ID NAME internet_total_hh internet_broadband internet_no_access
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 01001 0500000US… Auta… 21559 17850 402
## 2 01003 0500000US… Bald… 84047 71880 2149
## 3 01005 0500000US… Barb… 9322 6059 563
## 4 01007 0500000US… Bibb… 7259 5529 210
## 5 01009 0500000US… Blou… 21205 16971 314
## 6 01011 0500000US… Bull… 3429 2153 236
## # ℹ 7 more variables: computer_no_device <chr>, income_median <chr>,
## # edu_total_25plus <chr>, edu_hs <chr>, edu_bach <chr>, edu_mast <chr>,
## # edu_doc <chr>
# 8.4.x Ensure ACS numeric columns are numeric, not character
acs_2020_all <- acs_2020_all %>%
mutate(
across(
c(
internet_total_hh,
internet_broadband,
internet_no_access,
computer_no_device,
income_median,
edu_total_25plus,
edu_hs,
edu_bach,
edu_mast,
edu_doc
),
~ as.numeric(.)
)
)
# Check types again
glimpse(acs_2020_all)
## Rows: 3,221
## Columns: 13
## $ FIPS <chr> "01001", "01003", "01005", "01007", "01009", "01011…
## $ GEO_ID <chr> "0500000US01001", "0500000US01003", "0500000US01005…
## $ NAME <chr> "Autauga County, Alabama", "Baldwin County, Alabama…
## $ internet_total_hh <dbl> 21559, 84047, 9322, 7259, 21205, 3429, 6649, 44572,…
## $ internet_broadband <dbl> 17850, 71880, 6059, 5529, 16971, 2153, 4941, 35642,…
## $ internet_no_access <dbl> 402, 2149, 563, 210, 314, 236, 150, 931, 284, 250, …
## $ computer_no_device <dbl> 3307, 10018, 2700, 1520, 3920, 1040, 1558, 7999, 31…
## $ income_median <dbl> 57982, 61756, 34990, 51721, 48922, 33866, 44850, 50…
## $ edu_total_25plus <dbl> 37860, 155563, 17797, 15987, 39814, 6784, 13820, 78…
## $ edu_hs <dbl> 9663, 34521, 5171, 5879, 10582, 2498, 5583, 21801, …
## $ edu_bach <dbl> 6320, 31444, 1296, 1183, 3540, 413, 1396, 8296, 249…
## $ edu_mast <dbl> 3399, 13331, 540, 481, 1382, 152, 651, 4618, 740, 7…
## $ edu_doc <dbl> 498, 2016, 113, 41, 132, 28, 19, 883, 29, 132, 106,…
# Save the final cleaned + numeric + merged ACS dataset
write_csv(acs_2020_all, "processed_data/acs_all_2020_county_clean.csv")
saveRDS(acs_2020_all, "processed_data/acs_all_2020_county_clean.rds")
# Confirm saved
list.files("processed_data")
## [1] "acs_all_2020_county_clean.csv"
## [2] "acs_all_2020_county_clean.rds"
## [3] "acs_b15003_2020_county_clean.csv"
## [4] "acs_b15003_2020_county_clean.rds"
## [5] "acs_b19013_2020_county_clean.csv"
## [6] "acs_b19013_2020_county_clean.rds"
## [7] "acs_b28002_2020_county_clean.csv"
## [8] "acs_b28002_2020_county_clean.rds"
## [9] "airband_2020_county_clean.csv"
## [10] "airband_2020_county_clean.rds"
## [11] "airband_inconsistency_counties.csv"
## [12] "analysis_2020_county_final_sf.rds"
## [13] "analysis_2020_county_final.csv"
## [14] "analysis_2020_county_final.rds"
## [15] "analysis_2020_county_sf.rds"
## [16] "analysis_2020_county.csv"
## [17] "analysis_2020_county.rds"
## [18] "counties_sf.rds"
## [19] "fcc_2020_dec_clean.csv"
## [20] "fcc_2020_dec_clean.rds"
## [21] "fcc_2025_may_clean.csv"
## [22] "fcc_2025_may_clean.rds"
## [23] "master_2020_county_sf.rds"
## [24] "master_2020_county.csv"
## [25] "master_2020_county.rds"
## [26] "merged_fcc_svi_2020.csv"
## [27] "merged_fcc_svi_2020.rds"
## [28] "merged_fcc_svi_airband_2020.csv"
## [29] "merged_fcc_svi_airband_2020.rds"
## [30] "merged_fcc_svi_airband_ookla_2020.csv"
## [31] "merged_fcc_svi_airband_ookla_2020.rds"
## [32] "merged_master_2020.csv"
## [33] "merged_master_2020.rds"
## [34] "ookla_2020_county_clean.csv"
## [35] "ookla_2020_county_clean.rds"
## [36] "svi_2020_county_clean.csv"
## [37] "svi_2020_county_clean.rds"
## [38] "usda_rucc_2023_clean.csv"
## [39] "usda_rucc_2023_clean.rds"
Section 9 creates a single combined dataset by merging:
FCC broadband availability
SVI (overall + 4 themes)
Microsoft Airband broadband usage
ACS (internet, computer access, income, education)
library(dplyr)
library(readr)
# Load FCC + SVI + Airband merged dataset
merged_fcc_svi_airband <- readRDS("processed_data/merged_fcc_svi_airband_2020.rds")
# Load merged ACS dataset
acs_2020_all <- readRDS("processed_data/acs_all_2020_county_clean.rds")
# Quick preview
head(merged_fcc_svi_airband)
## # A tibble: 6 × 34
## county_fips state_name county_name.x housing_units tier1 tier2 tier3
## <chr> <chr> <chr> <dbl> <int> <int> <int>
## 1 01001 AL Autauga County 27544 4 4 4
## 2 01003 AL Baldwin County 137072 4 2 2
## 3 01005 AL Barbour County 15445 4 2 2
## 4 01007 AL Bibb County 11168 4 1 1
## 5 01009 AL Blount County 31265 4 1 1
## 6 01011 AL Bullock County 6136 4 3 1
## # ℹ 27 more variables: coverage_25_3 <dbl>, coverage_100_20 <dbl>,
## # coverage_250_25 <dbl>, coverage_1000_100 <dbl>, state.x <chr>,
## # county <chr>, svi_overall <dbl>, svi_soc <dbl>, svi_hh <dbl>,
## # svi_min <dbl>, svi_hous <dbl>, state.y <chr>, county_name.y <chr>,
## # airband_fcc_availability <chr>, airband_usage <chr>,
## # avg_download_mbps <dbl>, avg_upload_mbps <dbl>, avg_latency_ms <dbl>,
## # median_download_mbps <dbl>, median_upload_mbps <dbl>, total_tests <int>, …
head(acs_2020_all)
## # A tibble: 6 × 13
## FIPS GEO_ID NAME internet_total_hh internet_broadband internet_no_access
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 01001 0500000US… Auta… 21559 17850 402
## 2 01003 0500000US… Bald… 84047 71880 2149
## 3 01005 0500000US… Barb… 9322 6059 563
## 4 01007 0500000US… Bibb… 7259 5529 210
## 5 01009 0500000US… Blou… 21205 16971 314
## 6 01011 0500000US… Bull… 3429 2153 236
## # ℹ 7 more variables: computer_no_device <dbl>, income_median <dbl>,
## # edu_total_25plus <dbl>, edu_hs <dbl>, edu_bach <dbl>, edu_mast <dbl>,
## # edu_doc <dbl>
merged_fcc_svi_airband <- readRDS("processed_data/merged_fcc_svi_airband_2020.rds")
Merging is simple: Left join so that FIPS from FCC/SVI/Airband stays the base.
# 9.2 Create Master 2020 County Dataset
# 9.2.1 Ensure common key name: county_fips -> FIPS
merged_fcc_svi_airband <- merged_fcc_svi_airband %>%
dplyr::rename(FIPS = county_fips)
# 9.2.2 Merge FCC + SVI + Airband with ACS (by FIPS)
master_2020 <- merged_fcc_svi_airband %>%
dplyr::left_join(acs_2020_all, by = "FIPS")
# 9.2.3 Convert Airband columns to numeric (if they are stored as character)
master_2020 <- master_2020 %>%
dplyr::mutate(
airband_fcc_availability = as.numeric(airband_fcc_availability),
airband_usage = as.numeric(airband_usage)
)
# 9.2.4 Clean up duplicate state/county columns and keep a tidy set
master_2020 <- master_2020 %>%
dplyr::select(
FIPS,
state_name,
county_name = county_name.y, # final county name
housing_units,
tier1, tier2, tier3,
svi_overall, svi_soc, svi_hh, svi_min, svi_hous,
airband_fcc_availability,
airband_usage,
GEO_ID, NAME,
internet_total_hh,
internet_broadband,
internet_no_access,
computer_no_device,
income_median,
edu_total_25plus,
edu_hs,
edu_bach,
edu_mast,
edu_doc
)
# 9.2.5 Quick structural check
glimpse(master_2020)
## Rows: 3,141
## Columns: 26
## $ FIPS <chr> "01001", "01003", "01005", "01007", "01009", …
## $ state_name <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL…
## $ county_name <chr> "Autauga County", "Baldwin County", "Barbour …
## $ housing_units <dbl> 27544, 137072, 15445, 11168, 31265, 6136, 128…
## $ tier1 <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, …
## $ tier2 <int> 4, 2, 2, 1, 1, 3, 2, 4, 3, 1, 1, 1, 1, 1, 1, …
## $ tier3 <int> 4, 2, 2, 1, 1, 1, 2, 4, 3, 1, 1, 1, 1, 1, 1, …
## $ svi_overall <dbl> 0.5130, 0.3103, 0.9927, 0.8078, 0.5137, 0.831…
## $ svi_soc <dbl> 0.3838, 0.3253, 0.9567, 0.8008, 0.5757, 0.718…
## $ svi_hh <dbl> 0.7362, 0.2724, 0.9453, 0.5512, 0.7225, 0.651…
## $ svi_min <dbl> 0.6337, 0.5022, 0.8962, 0.6292, 0.4147, 0.980…
## $ svi_hous <dbl> 0.4309, 0.3612, 0.9949, 0.8622, 0.2743, 0.850…
## $ airband_fcc_availability <dbl> 0.8057, 0.8362, 0.6891, 0.3368, 0.7580, 0.936…
## $ airband_usage <dbl> 0.391, 0.452, 0.324, 0.136, 0.199, 0.157, 0.1…
## $ GEO_ID <chr> "0500000US01001", "0500000US01003", "0500000U…
## $ NAME <chr> "Autauga County, Alabama", "Baldwin County, A…
## $ internet_total_hh <dbl> 21559, 84047, 9322, 7259, 21205, 3429, 6649, …
## $ internet_broadband <dbl> 17850, 71880, 6059, 5529, 16971, 2153, 4941, …
## $ internet_no_access <dbl> 402, 2149, 563, 210, 314, 236, 150, 931, 284,…
## $ computer_no_device <dbl> 3307, 10018, 2700, 1520, 3920, 1040, 1558, 79…
## $ income_median <dbl> 57982, 61756, 34990, 51721, 48922, 33866, 448…
## $ edu_total_25plus <dbl> 37860, 155563, 17797, 15987, 39814, 6784, 138…
## $ edu_hs <dbl> 9663, 34521, 5171, 5879, 10582, 2498, 5583, 2…
## $ edu_bach <dbl> 6320, 31444, 1296, 1183, 3540, 413, 1396, 829…
## $ edu_mast <dbl> 3399, 13331, 540, 481, 1382, 152, 651, 4618, …
## $ edu_doc <dbl> 498, 2016, 113, 41, 132, 28, 19, 883, 29, 132…
head(master_2020)
## # A tibble: 6 × 26
## FIPS state_name county_name housing_units tier1 tier2 tier3 svi_overall
## <chr> <chr> <chr> <dbl> <int> <int> <int> <dbl>
## 1 01001 AL Autauga County 27544 4 4 4 0.513
## 2 01003 AL Baldwin County 137072 4 2 2 0.310
## 3 01005 AL Barbour County 15445 4 2 2 0.993
## 4 01007 AL Bibb County 11168 4 1 1 0.808
## 5 01009 AL Blount County 31265 4 1 1 0.514
## 6 01011 AL Bullock County 6136 4 3 1 0.831
## # ℹ 18 more variables: svi_soc <dbl>, svi_hh <dbl>, svi_min <dbl>,
## # svi_hous <dbl>, airband_fcc_availability <dbl>, airband_usage <dbl>,
## # GEO_ID <chr>, NAME <chr>, internet_total_hh <dbl>,
## # internet_broadband <dbl>, internet_no_access <dbl>,
## # computer_no_device <dbl>, income_median <dbl>, edu_total_25plus <dbl>,
## # edu_hs <dbl>, edu_bach <dbl>, edu_mast <dbl>, edu_doc <dbl>
# 9.3 Save Final Master 2020 County Dataset
write_csv(master_2020, "processed_data/master_2020_county.csv")
saveRDS(master_2020, "processed_data/master_2020_county.rds")
# Check that it's there
list.files("processed_data")
## [1] "acs_all_2020_county_clean.csv"
## [2] "acs_all_2020_county_clean.rds"
## [3] "acs_b15003_2020_county_clean.csv"
## [4] "acs_b15003_2020_county_clean.rds"
## [5] "acs_b19013_2020_county_clean.csv"
## [6] "acs_b19013_2020_county_clean.rds"
## [7] "acs_b28002_2020_county_clean.csv"
## [8] "acs_b28002_2020_county_clean.rds"
## [9] "airband_2020_county_clean.csv"
## [10] "airband_2020_county_clean.rds"
## [11] "airband_inconsistency_counties.csv"
## [12] "analysis_2020_county_final_sf.rds"
## [13] "analysis_2020_county_final.csv"
## [14] "analysis_2020_county_final.rds"
## [15] "analysis_2020_county_sf.rds"
## [16] "analysis_2020_county.csv"
## [17] "analysis_2020_county.rds"
## [18] "counties_sf.rds"
## [19] "fcc_2020_dec_clean.csv"
## [20] "fcc_2020_dec_clean.rds"
## [21] "fcc_2025_may_clean.csv"
## [22] "fcc_2025_may_clean.rds"
## [23] "master_2020_county_sf.rds"
## [24] "master_2020_county.csv"
## [25] "master_2020_county.rds"
## [26] "merged_fcc_svi_2020.csv"
## [27] "merged_fcc_svi_2020.rds"
## [28] "merged_fcc_svi_airband_2020.csv"
## [29] "merged_fcc_svi_airband_2020.rds"
## [30] "merged_fcc_svi_airband_ookla_2020.csv"
## [31] "merged_fcc_svi_airband_ookla_2020.rds"
## [32] "merged_master_2020.csv"
## [33] "merged_master_2020.rds"
## [34] "ookla_2020_county_clean.csv"
## [35] "ookla_2020_county_clean.rds"
## [36] "svi_2020_county_clean.csv"
## [37] "svi_2020_county_clean.rds"
## [38] "usda_rucc_2023_clean.csv"
## [39] "usda_rucc_2023_clean.rds"
The goal of this section is to combine:
counties_sf.rds → county boundaries (shapefile processed in Section 7)
master_2020_county.rds → all your attributes (FCC, SVI, Airband, ACS)
into a single sf object:
master_2020_sf → one row per county, with both attributes and geometry.
We’ll join them using:
GEOID from the shapefile
FIPS from master_2020
library(sf)
library(dplyr)
# 10.1 Load processed county shapefile (sf object)
counties_sf <- readRDS("processed_data/counties_sf.rds")
# 10.1 Load non-spatial master dataset
master_2020 <- readRDS("processed_data/master_2020_county.rds")
# Quick checks
counties_sf
## Simple feature collection with 3234 features and 4 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -179.2311 ymin: -14.60181 xmax: 179.8597 ymax: 71.43979
## Geodetic CRS: WGS 84
## First 10 features:
## GEOID NAME STATEFP COUNTYFP geometry
## 1 31039 Cuming 31 039 MULTIPOLYGON (((-97.01952 4...
## 2 53069 Wahkiakum 53 069 MULTIPOLYGON (((-123.4364 4...
## 3 35011 De Baca 35 011 MULTIPOLYGON (((-104.5674 3...
## 4 31109 Lancaster 31 109 MULTIPOLYGON (((-96.91075 4...
## 5 31129 Nuckolls 31 129 MULTIPOLYGON (((-98.27367 4...
## 6 72085 Las Piedras 72 085 MULTIPOLYGON (((-65.91048 1...
## 7 46099 Minnehaha 46 099 MULTIPOLYGON (((-96.89022 4...
## 8 48327 Menard 48 327 MULTIPOLYGON (((-99.82187 3...
## 9 06091 Sierra 06 091 MULTIPOLYGON (((-120.6556 3...
## 10 21053 Clinton 21 053 MULTIPOLYGON (((-85.2391 36...
glimpse(master_2020)
## Rows: 3,141
## Columns: 26
## $ FIPS <chr> "01001", "01003", "01005", "01007", "01009", …
## $ state_name <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL…
## $ county_name <chr> "Autauga County", "Baldwin County", "Barbour …
## $ housing_units <dbl> 27544, 137072, 15445, 11168, 31265, 6136, 128…
## $ tier1 <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, …
## $ tier2 <int> 4, 2, 2, 1, 1, 3, 2, 4, 3, 1, 1, 1, 1, 1, 1, …
## $ tier3 <int> 4, 2, 2, 1, 1, 1, 2, 4, 3, 1, 1, 1, 1, 1, 1, …
## $ svi_overall <dbl> 0.5130, 0.3103, 0.9927, 0.8078, 0.5137, 0.831…
## $ svi_soc <dbl> 0.3838, 0.3253, 0.9567, 0.8008, 0.5757, 0.718…
## $ svi_hh <dbl> 0.7362, 0.2724, 0.9453, 0.5512, 0.7225, 0.651…
## $ svi_min <dbl> 0.6337, 0.5022, 0.8962, 0.6292, 0.4147, 0.980…
## $ svi_hous <dbl> 0.4309, 0.3612, 0.9949, 0.8622, 0.2743, 0.850…
## $ airband_fcc_availability <dbl> 0.8057, 0.8362, 0.6891, 0.3368, 0.7580, 0.936…
## $ airband_usage <dbl> 0.391, 0.452, 0.324, 0.136, 0.199, 0.157, 0.1…
## $ GEO_ID <chr> "0500000US01001", "0500000US01003", "0500000U…
## $ NAME <chr> "Autauga County, Alabama", "Baldwin County, A…
## $ internet_total_hh <dbl> 21559, 84047, 9322, 7259, 21205, 3429, 6649, …
## $ internet_broadband <dbl> 17850, 71880, 6059, 5529, 16971, 2153, 4941, …
## $ internet_no_access <dbl> 402, 2149, 563, 210, 314, 236, 150, 931, 284,…
## $ computer_no_device <dbl> 3307, 10018, 2700, 1520, 3920, 1040, 1558, 79…
## $ income_median <dbl> 57982, 61756, 34990, 51721, 48922, 33866, 448…
## $ edu_total_25plus <dbl> 37860, 155563, 17797, 15987, 39814, 6784, 138…
## $ edu_hs <dbl> 9663, 34521, 5171, 5879, 10582, 2498, 5583, 2…
## $ edu_bach <dbl> 6320, 31444, 1296, 1183, 3540, 413, 1396, 829…
## $ edu_mast <dbl> 3399, 13331, 540, 481, 1382, 152, 651, 4618, …
## $ edu_doc <dbl> 498, 2016, 113, 41, 132, 28, 19, 883, 29, 132…
To keep the geometry, we start from counties_sf and
left_join() the master_2020 table into it.
# 10.2 Join master data to county polygons
# GEOID (shapefile) matches FIPS (master table)
master_2020_sf <- counties_sf %>%
left_join(master_2020, by = c("GEOID" = "FIPS"))
# Check result: should be an sf object with all master_2020 columns + geometry
master_2020_sf
## Simple feature collection with 3234 features and 29 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -179.2311 ymin: -14.60181 xmax: 179.8597 ymax: 71.43979
## Geodetic CRS: WGS 84
## First 10 features:
## GEOID NAME.x STATEFP COUNTYFP state_name county_name housing_units
## 1 31039 Cuming 31 039 NE Cuming County 5494
## 2 53069 Wahkiakum 53 069 WA Wahkiakum County 2790
## 3 35011 De Baca 35 011 NM De Baca County 1531
## 4 31109 Lancaster 31 109 NE Lancaster County 148105
## 5 31129 Nuckolls 31 129 NE Nuckolls County 3396
## 6 72085 Las Piedras 72 085 <NA> <NA> NA
## 7 46099 Minnehaha 46 099 SD Minnehaha County 96413
## 8 48327 Menard 48 327 TX Menard County 2689
## 9 06091 Sierra 06 091 CA Sierra County 2581
## 10 21053 Clinton 21 053 KY Clinton County 6151
## tier1 tier2 tier3 svi_overall svi_soc svi_hh svi_min svi_hous
## 1 4 4 3 0.1610 0.1101 0.5738 0.3762 0.1439
## 2 4 2 2 0.1706 0.2314 0.2167 0.4411 0.1467
## 3 4 4 3 0.5891 0.7782 0.2447 0.9621 0.2775
## 4 4 4 4 0.3781 0.2629 0.2412 0.5407 0.6785
## 5 4 4 2 0.0707 0.0904 0.2371 0.1359 0.1248
## 6 NA NA NA NA NA NA NA NA
## 7 4 4 4 0.4013 0.1677 0.5484 0.5242 0.7072
## 8 4 3 1 0.7702 0.6680 0.8211 0.7855 0.6292
## 9 4 1 1 0.2810 0.5325 0.3501 0.3682 0.0792
## 10 4 2 2 0.7600 0.8291 0.7358 0.0907 0.7728
## airband_fcc_availability airband_usage GEO_ID
## 1 0.8900 0.284 0500000US31039
## 2 0.6246 0.273 0500000US53069
## 3 0.8124 0.455 0500000US35011
## 4 1.0000 0.677 0500000US31109
## 5 0.7406 0.429 0500000US31129
## 6 NA NA <NA>
## 7 0.9855 0.692 0500000US46099
## 8 0.9097 0.050 0500000US48327
## 9 0.6599 0.077 0500000US06091
## 10 0.9208 0.129 0500000US21053
## NAME.y internet_total_hh internet_broadband
## 1 Cuming County, Nebraska 3854 2817
## 2 Wahkiakum County, Washington 1900 1551
## 3 De Baca County, New Mexico 554 381
## 4 Lancaster County, Nebraska 126666 113669
## 5 Nuckolls County, Nebraska 1852 1415
## 6 <NA> NA NA
## 7 Minnehaha County, South Dakota 78453 69385
## 8 Menard County, Texas 1035 662
## 9 Sierra County, California 1250 946
## 10 Clinton County, Kentucky 4023 2472
## internet_no_access computer_no_device income_median edu_total_25plus edu_hs
## 1 208 829 59202 6071 2103
## 2 83 266 54524 3356 719
## 3 12 161 31532 941 248
## 4 3450 9547 62464 196706 34338
## 5 37 400 52975 3060 857
## 6 NA NA NA NA NA
## 7 2515 6553 63699 126191 28726
## 8 95 278 43826 1670 335
## 9 74 230 52103 2302 559
## 10 61 1490 33092 7209 2450
## edu_bach edu_mast edu_doc geometry
## 1 1026 259 15 MULTIPOLYGON (((-97.01952 4...
## 2 389 253 20 MULTIPOLYGON (((-123.4364 4...
## 3 63 62 0 MULTIPOLYGON (((-104.5674 3...
## 4 48821 19426 5242 MULTIPOLYGON (((-96.91075 4...
## 5 499 153 48 MULTIPOLYGON (((-98.27367 4...
## 6 NA NA NA MULTIPOLYGON (((-65.91048 1...
## 7 29293 9810 1293 MULTIPOLYGON (((-96.89022 4...
## 8 253 114 33 MULTIPOLYGON (((-99.82187 3...
## 9 202 164 9 MULTIPOLYGON (((-120.6556 3...
## 10 494 255 80 MULTIPOLYGON (((-85.2391 36...
sf::st_crs(master_2020_sf)
## Coordinate Reference System:
## User input: EPSG:4326
## wkt:
## GEOGCRS["WGS 84",
## ENSEMBLE["World Geodetic System 1984 ensemble",
## MEMBER["World Geodetic System 1984 (Transit)"],
## MEMBER["World Geodetic System 1984 (G730)"],
## MEMBER["World Geodetic System 1984 (G873)"],
## MEMBER["World Geodetic System 1984 (G1150)"],
## MEMBER["World Geodetic System 1984 (G1674)"],
## MEMBER["World Geodetic System 1984 (G1762)"],
## MEMBER["World Geodetic System 1984 (G2139)"],
## MEMBER["World Geodetic System 1984 (G2296)"],
## ELLIPSOID["WGS 84",6378137,298.257223563,
## LENGTHUNIT["metre",1]],
## ENSEMBLEACCURACY[2.0]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433]],
## CS[ellipsoidal,2],
## AXIS["geodetic latitude (Lat)",north,
## ORDER[1],
## ANGLEUNIT["degree",0.0174532925199433]],
## AXIS["geodetic longitude (Lon)",east,
## ORDER[2],
## ANGLEUNIT["degree",0.0174532925199433]],
## USAGE[
## SCOPE["Horizontal component of 3D system."],
## AREA["World."],
## BBOX[-90,-180,90,180]],
## ID["EPSG",4326]]
glimpse(master_2020_sf)
## Rows: 3,234
## Columns: 30
## $ GEOID <chr> "31039", "53069", "35011", "31109", "31129", …
## $ NAME.x <chr> "Cuming", "Wahkiakum", "De Baca", "Lancaster"…
## $ STATEFP <chr> "31", "53", "35", "31", "31", "72", "46", "48…
## $ COUNTYFP <chr> "039", "069", "011", "109", "129", "085", "09…
## $ state_name <chr> "NE", "WA", "NM", "NE", "NE", NA, "SD", "TX",…
## $ county_name <chr> "Cuming County", "Wahkiakum County", "De Baca…
## $ housing_units <dbl> 5494, 2790, 1531, 148105, 3396, NA, 96413, 26…
## $ tier1 <int> 4, 4, 4, 4, 4, NA, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ tier2 <int> 4, 2, 4, 4, 4, NA, 4, 3, 1, 2, 4, 4, 1, 4, 4,…
## $ tier3 <int> 3, 2, 3, 4, 2, NA, 4, 1, 1, 2, 4, 3, 1, 1, 3,…
## $ svi_overall <dbl> 0.1610, 0.1706, 0.5891, 0.3781, 0.0707, NA, 0…
## $ svi_soc <dbl> 0.1101, 0.2314, 0.7782, 0.2629, 0.0904, NA, 0…
## $ svi_hh <dbl> 0.5738, 0.2167, 0.2447, 0.2412, 0.2371, NA, 0…
## $ svi_min <dbl> 0.3762, 0.4411, 0.9621, 0.5407, 0.1359, NA, 0…
## $ svi_hous <dbl> 0.1439, 0.1467, 0.2775, 0.6785, 0.1248, NA, 0…
## $ airband_fcc_availability <dbl> 0.8900, 0.6246, 0.8124, 1.0000, 0.7406, NA, 0…
## $ airband_usage <dbl> 0.284, 0.273, 0.455, 0.677, 0.429, NA, 0.692,…
## $ GEO_ID <chr> "0500000US31039", "0500000US53069", "0500000U…
## $ NAME.y <chr> "Cuming County, Nebraska", "Wahkiakum County,…
## $ internet_total_hh <dbl> 3854, 1900, 554, 126666, 1852, NA, 78453, 103…
## $ internet_broadband <dbl> 2817, 1551, 381, 113669, 1415, NA, 69385, 662…
## $ internet_no_access <dbl> 208, 83, 12, 3450, 37, NA, 2515, 95, 74, 61, …
## $ computer_no_device <dbl> 829, 266, 161, 9547, 400, NA, 6553, 278, 230,…
## $ income_median <dbl> 59202, 54524, 31532, 62464, 52975, NA, 63699,…
## $ edu_total_25plus <dbl> 6071, 3356, 941, 196706, 3060, NA, 126191, 16…
## $ edu_hs <dbl> 2103, 719, 248, 34338, 857, NA, 28726, 335, 5…
## $ edu_bach <dbl> 1026, 389, 63, 48821, 499, NA, 29293, 253, 20…
## $ edu_mast <dbl> 259, 253, 62, 19426, 153, NA, 9810, 114, 164,…
## $ edu_doc <dbl> 15, 20, 0, 5242, 48, NA, 1293, 33, 9, 80, 473…
## $ geometry <MULTIPOLYGON [°]> MULTIPOLYGON (((-97.01952 4..., …
# At this point, you should have created this object:
# master_2020_sf <- left_join(counties_sf, master_2020, by = "GEOID")
# Check structure
glimpse(master_2020_sf)
## Rows: 3,234
## Columns: 30
## $ GEOID <chr> "31039", "53069", "35011", "31109", "31129", …
## $ NAME.x <chr> "Cuming", "Wahkiakum", "De Baca", "Lancaster"…
## $ STATEFP <chr> "31", "53", "35", "31", "31", "72", "46", "48…
## $ COUNTYFP <chr> "039", "069", "011", "109", "129", "085", "09…
## $ state_name <chr> "NE", "WA", "NM", "NE", "NE", NA, "SD", "TX",…
## $ county_name <chr> "Cuming County", "Wahkiakum County", "De Baca…
## $ housing_units <dbl> 5494, 2790, 1531, 148105, 3396, NA, 96413, 26…
## $ tier1 <int> 4, 4, 4, 4, 4, NA, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ tier2 <int> 4, 2, 4, 4, 4, NA, 4, 3, 1, 2, 4, 4, 1, 4, 4,…
## $ tier3 <int> 3, 2, 3, 4, 2, NA, 4, 1, 1, 2, 4, 3, 1, 1, 3,…
## $ svi_overall <dbl> 0.1610, 0.1706, 0.5891, 0.3781, 0.0707, NA, 0…
## $ svi_soc <dbl> 0.1101, 0.2314, 0.7782, 0.2629, 0.0904, NA, 0…
## $ svi_hh <dbl> 0.5738, 0.2167, 0.2447, 0.2412, 0.2371, NA, 0…
## $ svi_min <dbl> 0.3762, 0.4411, 0.9621, 0.5407, 0.1359, NA, 0…
## $ svi_hous <dbl> 0.1439, 0.1467, 0.2775, 0.6785, 0.1248, NA, 0…
## $ airband_fcc_availability <dbl> 0.8900, 0.6246, 0.8124, 1.0000, 0.7406, NA, 0…
## $ airband_usage <dbl> 0.284, 0.273, 0.455, 0.677, 0.429, NA, 0.692,…
## $ GEO_ID <chr> "0500000US31039", "0500000US53069", "0500000U…
## $ NAME.y <chr> "Cuming County, Nebraska", "Wahkiakum County,…
## $ internet_total_hh <dbl> 3854, 1900, 554, 126666, 1852, NA, 78453, 103…
## $ internet_broadband <dbl> 2817, 1551, 381, 113669, 1415, NA, 69385, 662…
## $ internet_no_access <dbl> 208, 83, 12, 3450, 37, NA, 2515, 95, 74, 61, …
## $ computer_no_device <dbl> 829, 266, 161, 9547, 400, NA, 6553, 278, 230,…
## $ income_median <dbl> 59202, 54524, 31532, 62464, 52975, NA, 63699,…
## $ edu_total_25plus <dbl> 6071, 3356, 941, 196706, 3060, NA, 126191, 16…
## $ edu_hs <dbl> 2103, 719, 248, 34338, 857, NA, 28726, 335, 5…
## $ edu_bach <dbl> 1026, 389, 63, 48821, 499, NA, 29293, 253, 20…
## $ edu_mast <dbl> 259, 253, 62, 19426, 153, NA, 9810, 114, 164,…
## $ edu_doc <dbl> 15, 20, 0, 5242, 48, NA, 1293, 33, 9, 80, 473…
## $ geometry <MULTIPOLYGON [°]> MULTIPOLYGON (((-97.01952 4..., …
# Save spatial master dataset
saveRDS(master_2020_sf, "processed_data/master_2020_county_sf.rds")
# Confirm saved file
list.files("processed_data", pattern = "sf")
## [1] "analysis_2020_county_final_sf.rds" "analysis_2020_county_sf.rds"
## [3] "counties_sf.rds" "master_2020_county_sf.rds"
The final analysis dataset was created by filtering the spatial master table to retain only counties with complete data across key broadband and vulnerability indicators, including internet households, median income, Airband availability and usage, and SVI overall score. This reduced the dataset from 3,234 county-equivalents in the TIGER/Line shapefile to 3,120 valid U.S. counties with fully complete records. These 3,120 counties form the final, clean analytical base for clustering, spatial joins, and digital divide analysis.
library(dplyr)
library(sf)
# 1) Load the spatial master dataset from Section 10
master_2020_sf <- readRDS("processed_data/master_2020_county_sf.rds")
# 2) Keep only counties with complete data in key fields
analysis_2020_sf <- master_2020_sf %>%
filter(
!is.na(internet_total_hh),
!is.na(income_median),
!is.na(airband_fcc_availability),
!is.na(airband_usage),
!is.na(svi_overall)
)
# How many counties remain?
nrow(master_2020_sf) # before
## [1] 3234
nrow(analysis_2020_sf) # after
## [1] 3120
# 3) Create a non-spatial version (for clustering, regressions, etc.)
analysis_2020 <- analysis_2020_sf %>%
st_drop_geometry()
# 4) Save final analysis datasets
saveRDS(analysis_2020_sf, "processed_data/analysis_2020_county_sf.rds")
saveRDS(analysis_2020, "processed_data/analysis_2020_county.rds")
readr::write_csv(analysis_2020, "processed_data/analysis_2020_county.csv")
# 5) Confirm files
list.files("processed_data", pattern = "analysis_2020")
## [1] "analysis_2020_county_final_sf.rds" "analysis_2020_county_final.csv"
## [3] "analysis_2020_county_final.rds" "analysis_2020_county_sf.rds"
## [5] "analysis_2020_county.csv" "analysis_2020_county.rds"
Several internal validation checks were performed to ensure coherence across the merged dataset. Broadband subscription counts were verified to never exceed total household counts, and all ACS-derived broadband and device-access percentages fell within the valid range of 0–100%. A correlation analysis between median household income and the SVI socioeconomic theme produced the expected negative relationship, confirming consistency between ACS demographics and CDC social vulnerability metrics. Some counties displayed Airband usage values greater than modeled availability—this is a known feature of Airband’s methodology and does not represent data errors. Overall, no logical inconsistencies were detected in the final analysis dataset.
library(dplyr)
# Load final non-spatial dataset
analysis_2020 <- readRDS("processed_data/analysis_2020_county.rds")
# 1. Logical Broadband Checks
# Broadband subscriptions cannot exceed households
invalid_broadband_logic <- analysis_2020 %>%
filter(
internet_broadband > internet_total_hh |
internet_no_access > internet_total_hh
)
invalid_broadband_logic # ideally 0 rows
## [1] GEOID NAME.x STATEFP
## [4] COUNTYFP state_name county_name
## [7] housing_units tier1 tier2
## [10] tier3 svi_overall svi_soc
## [13] svi_hh svi_min svi_hous
## [16] airband_fcc_availability airband_usage GEO_ID
## [19] NAME.y internet_total_hh internet_broadband
## [22] internet_no_access computer_no_device income_median
## [25] edu_total_25plus edu_hs edu_bach
## [28] edu_mast edu_doc
## <0 rows> (or 0-length row.names)
# 2. ACS Percentage Validity
acs_pct_check <- analysis_2020 %>%
mutate(
pct_broadband = internet_broadband / internet_total_hh * 100,
pct_no_access = internet_no_access / internet_total_hh * 100,
pct_no_device = computer_no_device / internet_total_hh * 100
)
invalid_pct <- acs_pct_check %>%
filter(
pct_broadband < 0 | pct_broadband > 100 |
pct_no_access < 0 | pct_no_access > 100 |
pct_no_device < 0 | pct_no_device > 100
)
invalid_pct # ideally 0 rows
## [1] GEOID NAME.x STATEFP
## [4] COUNTYFP state_name county_name
## [7] housing_units tier1 tier2
## [10] tier3 svi_overall svi_soc
## [13] svi_hh svi_min svi_hous
## [16] airband_fcc_availability airband_usage GEO_ID
## [19] NAME.y internet_total_hh internet_broadband
## [22] internet_no_access computer_no_device income_median
## [25] edu_total_25plus edu_hs edu_bach
## [28] edu_mast edu_doc pct_broadband
## [31] pct_no_access pct_no_device
## <0 rows> (or 0-length row.names)
# 3. Income vs SVI Socioeconomic Relationship
cor_income_svi <- cor(
analysis_2020$income_median,
analysis_2020$svi_soc,
use = "complete.obs"
)
cor_income_svi
## [1] -0.6171082
# 4. Airband Logical Check:
airband_inconsistency <- analysis_2020 %>%
filter(
airband_usage > airband_fcc_availability
)
nrow(airband_inconsistency)
## [1] 50
head(airband_inconsistency)
## GEOID NAME.x STATEFP COUNTYFP state_name county_name
## 1 22107 Tensas 22 107 LA Tensas Parish
## 2 02188 Northwest Arctic 02 188 AK Northwest Arctic Borough
## 3 01105 Perry 01 105 AL Perry County
## 4 20151 Pratt 20 151 KS Pratt County
## 5 01063 Greene 01 063 AL Greene County
## 6 28055 Issaquena 28 055 MS Issaquena County
## housing_units tier1 tier2 tier3 svi_overall svi_soc svi_hh svi_min svi_hous
## 1 3734 4 1 1 0.7234 0.9093 0.1735 0.9144 0.4930
## 2 3329 4 4 1 0.9153 0.8915 0.4376 0.9933 0.9749
## 3 5751 4 1 1 0.9137 0.9803 0.5732 0.9653 0.7314
## 4 5644 4 4 3 0.3877 0.2002 0.7180 0.3819 0.5156
## 5 5979 4 1 1 0.9513 0.9885 0.5191 0.9850 0.8908
## 6 1487 4 1 1 0.9847 0.9634 0.4379 0.9297 1.0000
## airband_fcc_availability airband_usage GEO_ID
## 1 0.0305 0.054 0500000US22107
## 2 0.0081 0.039 0500000US02188
## 3 0.0012 0.032 0500000US01105
## 4 0.7943 0.884 0500000US20151
## 5 0.0085 0.013 0500000US01063
## 6 0.0173 0.145 0500000US28055
## NAME.y internet_total_hh internet_broadband
## 1 Tensas Parish, Louisiana 1814 998
## 2 Northwest Arctic Borough, Alaska 1795 1407
## 3 Perry County, Alabama 3140 1738
## 4 Pratt County, Kansas 3733 3000
## 5 Greene County, Alabama 3178 1768
## 6 Issaquena County, Mississippi 462 154
## internet_no_access computer_no_device income_median edu_total_25plus edu_hs
## 1 86 730 29767 3135 1172
## 2 21 367 63750 4251 1663
## 3 53 1349 23875 5723 1882
## 4 158 575 54644 6042 1240
## 5 152 1258 26688 5768 2131
## 6 68 240 28333 1001 280
## edu_bach edu_mast edu_doc
## 1 469 91 19
## 2 354 213 14
## 3 661 272 71
## 4 1236 378 21
## 5 332 190 46
## 6 9 20 1
Fifty counties exhibited cases where broadband usage exceeded FCC-reported broadband availability. This is an expected pattern in Airband datasets due to differences in modeling assumptions: FCC availability represents provider-reported infrastructure, while Airband usage measures active connectivity across satellite, mobile, and other modalities. These counties were retained, as the differences reflect true measurement variation rather than data quality issues. ## 11.1.1 Table of the fifty counties
# Build a clean table for reporting
airband_inconsistency_table <- airband_inconsistency %>%
dplyr::select(
GEOID,
state_name,
county_name,
airband_fcc_availability,
airband_usage
) %>%
arrange(state_name, county_name)
# Display preview
head(airband_inconsistency_table)
## GEOID state_name county_name airband_fcc_availability
## 1 02185 AK North Slope Borough 0.0061
## 2 02188 AK Northwest Arctic Borough 0.0081
## 3 02198 AK Prince of Wales-Hyder Census Area 0.0282
## 4 02290 AK Yukon-Koyukuk Census Area 0.0130
## 5 01063 AL Greene County 0.0085
## 6 01105 AL Perry County 0.0012
## airband_usage
## 1 0.051
## 2 0.039
## 3 0.040
## 4 0.139
## 5 0.013
## 6 0.032
# Print full table
airband_inconsistency_table
## GEOID state_name county_name airband_fcc_availability
## 1 02185 AK North Slope Borough 0.0061
## 2 02188 AK Northwest Arctic Borough 0.0081
## 3 02198 AK Prince of Wales-Hyder Census Area 0.0282
## 4 02290 AK Yukon-Koyukuk Census Area 0.0130
## 5 01063 AL Greene County 0.0085
## 6 01105 AL Perry County 0.0012
## 7 05013 AR Calhoun County 0.0852
## 8 05095 AR Monroe County 0.0557
## 9 12029 FL Dixie County 0.0078
## 10 13053 GA Chattahoochee County 0.5456
## 11 13101 GA Echols County 0.0002
## 12 13125 GA Glascock County 0.0101
## 13 13201 GA Miller County 0.1072
## 14 13259 GA Stewart County 0.0082
## 15 13301 GA Warren County 0.0019
## 16 15005 HI Kalawao County 0.1977
## 17 16069 ID Nez Perce County 0.8671
## 18 17003 IL Alexander County 0.0009
## 19 20119 KS Meade County 0.0010
## 20 20151 KS Pratt County 0.7943
## 21 22023 LA Cameron Parish 0.0184
## 22 22107 LA Tensas Parish 0.0305
## 23 26095 MI Luce County 0.0393
## 24 29017 MO Bollinger County 0.0161
## 25 29089 MO Howard County 0.1625
## 26 29153 MO Ozark County 0.0375
## 27 29157 MO Perry County 0.5334
## 28 28031 MS Covington County 0.0451
## 29 28055 MS Issaquena County 0.0173
## 30 28063 MS Jefferson County 0.0309
## 31 30033 MT Garfield County 0.4285
## 32 30055 MT McCone County 0.4946
## 33 30109 MT Wibaux County 0.0826
## 34 38053 ND McKenzie County 0.7279
## 35 35053 NM Socorro County 0.0453
## 36 32015 NV Lander County 0.0022
## 37 32033 NV White Pine County 0.0095
## 38 41021 OR Gilliam County 0.3797
## 39 45065 SC McCormick County 0.4012
## 40 48211 TX Hemphill County 0.0937
## 41 48271 TX Kinney County 0.0221
## 42 48301 TX Loving County 0.3018
## 43 49019 UT Grand County 0.5325
## 44 51570 VA Colonial Heights city 0.9872
## 45 51610 VA Falls Church city 0.9860
## 46 51081 VA Greensville County 0.2399
## 47 51683 VA Manassas city 0.9611
## 48 51820 VA Waynesboro city 0.9838
## 49 51840 VA Winchester city 0.9978
## 50 54061 WV Monongalia County 0.5361
## airband_usage
## 1 0.051
## 2 0.039
## 3 0.040
## 4 0.139
## 5 0.013
## 6 0.032
## 7 0.202
## 8 0.058
## 9 0.041
## 10 0.755
## 11 0.034
## 12 0.033
## 13 0.187
## 14 0.047
## 15 0.049
## 16 0.221
## 17 0.870
## 18 0.104
## 19 0.141
## 20 0.884
## 21 0.134
## 22 0.054
## 23 0.093
## 24 0.066
## 25 0.183
## 26 0.048
## 27 0.600
## 28 0.054
## 29 0.145
## 30 0.053
## 31 0.779
## 32 0.514
## 33 0.125
## 34 0.878
## 35 0.211
## 36 0.056
## 37 0.083
## 38 0.409
## 39 0.439
## 40 0.219
## 41 0.091
## 42 0.941
## 43 0.813
## 44 1.000
## 45 1.000
## 46 0.377
## 47 1.000
## 48 1.000
## 49 1.000
## 50 0.562
# Save CSV for documentation
readr::write_csv(
airband_inconsistency_table,
"processed_data/airband_inconsistency_counties.csv"
)
In this section, we detect outliers using two statistical approaches:
1. IQR Method – flags extreme values beyond the 1.5×IQR
rule
2. Z-Scores – flags values where |z| > 3 relative to
the mean
We apply both methods to all continuous variables and then examine the geographic distribution of outlier counties.
## -------------------------------
## 11.1 Identify continuous fields
## -------------------------------
continuous_vars <- c(
"internet_total_hh", "internet_broadband", "internet_no_access",
"computer_no_device", "income_median",
"edu_total_25plus", "edu_hs", "edu_bach", "edu_mast", "edu_doc",
"airband_fcc_availability", "airband_usage"
)
num_data <- analysis_2020 %>% dplyr::select(GEOID, county_name, state_name, all_of(continuous_vars))
## -------------------------------
## 11.2 IQR Outlier Detection
## -------------------------------
iqr_flag <- function(x) {
Q1 <- quantile(x, 0.25, na.rm = TRUE)
Q3 <- quantile(x, 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
(x < (Q1 - 1.5 * IQR)) | (x > (Q3 + 1.5 * IQR))
}
iqr_outliers <- num_data %>%
mutate(across(all_of(continuous_vars), iqr_flag))
iqr_summary <- data.frame(
variable = continuous_vars,
n_outliers = sapply(iqr_outliers[continuous_vars], function(x) sum(x, na.rm=TRUE))
)
iqr_summary
## variable n_outliers
## internet_total_hh internet_total_hh 418
## internet_broadband internet_broadband 434
## internet_no_access internet_no_access 393
## computer_no_device computer_no_device 323
## income_median income_median 126
## edu_total_25plus edu_total_25plus 429
## edu_hs edu_hs 379
## edu_bach edu_bach 480
## edu_mast edu_mast 466
## edu_doc edu_doc 480
## airband_fcc_availability airband_fcc_availability 206
## airband_usage airband_usage 0
## -------------------------------
## 11.3 Z-Score Outlier Detection
## -------------------------------
z_df <- num_data %>%
mutate(across(all_of(continuous_vars),
~ (.-mean(., na.rm=TRUE)) / sd(., na.rm=TRUE)))
z_outliers <- z_df %>%
mutate(across(all_of(continuous_vars), ~ abs(.) > 3))
z_summary <- data.frame(
variable = continuous_vars,
n_outliers = sapply(z_outliers[continuous_vars], function(x) sum(x, na.rm=TRUE))
)
z_summary
## variable n_outliers
## internet_total_hh internet_total_hh 44
## internet_broadband internet_broadband 46
## internet_no_access internet_no_access 41
## computer_no_device computer_no_device 37
## income_median income_median 52
## edu_total_25plus edu_total_25plus 38
## edu_hs edu_hs 39
## edu_bach edu_bach 47
## edu_mast edu_mast 54
## edu_doc edu_doc 47
## airband_fcc_availability airband_fcc_availability 72
## airband_usage airband_usage 0
## 11.4 Compile Outlier Counties (Geographic Context)
# A county is an outlier if flagged by either method
combined_flags <- (iqr_outliers[continuous_vars] | z_outliers[continuous_vars])
# Extract counties flagged at least once
outlier_counties <- num_data %>%
mutate(any_outlier = apply(combined_flags, 1, any)) %>%
filter(any_outlier == TRUE) %>%
dplyr::select(GEOID, county_name, state_name)
# Show first few
head(outlier_counties)
## GEOID county_name state_name
## 1 31109 Lancaster County NE
## 2 46099 Minnehaha County SD
## 3 01027 Clay County AL
## 4 05137 Stone County AR
## 5 42007 Beaver County PA
## 6 37037 Chatham County NC
Interpretation Summary
Large-population counties (e.g., Los Angeles, Cook, Harris) appear as
outliers due to extremely high total households, broadband counts, and
education levels.
- Very small counties (e.g., Loving TX, Kalawao HI, rural Alaska census
areas) produce outliers in the opposite direction because of very small
denominators.
- Broadband usage outliers cluster in: - Tribal regions (low broadband
adoption)
- Appalachia(low income & limited infrastructure)
- Major metros(very high availability and usage)
- Income outliers highlight: - Rich suburban counties in Virginia,
Maryland, California - Poor rural counties in Mississippi, Alabama, and
Arkansas
Feature engineering creates new variables that better capture digital divide conditions across U.S. counties. These engineered variables will be used in later analysis, clustering, and visualization.
analysis_2020 <- analysis_2020 %>%
mutate(
# Broadband gap: difference between "available" vs "actually used"
broadband_gap = airband_fcc_availability - airband_usage,
# Affordability index: inverse of income (lower income = higher affordability risk)
affordability_index = 1 / income_median,
# Technology mix placeholder (FCC county data does not include tech details)
technology_mix = NA,
# Composite score combining SVI + broadband usage deficit
digital_vulnerability_score =
(svi_overall * 0.5) +
((1 - airband_usage) * 0.5),
# Priority intervention: worst quartile in BOTH vulnerability & broadband_gap
priority_intervention_flag =
(ntile(digital_vulnerability_score, 4) == 4 &
ntile(broadband_gap, 4) == 4)
)
library(sf)
library(dplyr)
library(readr)
# Ensure processed_data exists
dir.create("processed_data", showWarnings = FALSE)
# 1) Save updated spatial dataset (with new derived variables)
saveRDS(analysis_2020_sf,
"processed_data/analysis_2020_county_final_sf.rds")
# 2) Create non-spatial version and save
analysis_2020 <- analysis_2020_sf %>% st_drop_geometry()
saveRDS(analysis_2020,
"processed_data/analysis_2020_county_final.rds")
write_csv(analysis_2020,
"processed_data/analysis_2020_county_final.csv")
# Confirm files
list.files("processed_data", pattern = "analysis_2020_county_final")
## [1] "analysis_2020_county_final_sf.rds" "analysis_2020_county_final.csv"
## [3] "analysis_2020_county_final.rds"
library(dplyr)
# Load the updated non-spatial dataset (with all derived variables)
analysis_2020 <- readRDS("processed_data/analysis_2020_county_final.rds")
# Now safely transform for modeling
analysis_2020 <- analysis_2020 %>%
mutate(
# Standardized versions (z-scores) for clustering
z_income = scale(income_median),
z_broadband_usage = scale(airband_usage),
z_svi = scale(svi_overall),
# Quartile categories for visualization
income_quartile = ntile(income_median, 4),
broadband_quartile = ntile(airband_usage, 4),
svi_quartile = ntile(svi_overall, 4),
# Log transformation for skewed variables
log_income = log(income_median),
log_total_hh = log(internet_total_hh)
)
# Compute centroids for mapping
county_centroids <- st_centroid(counties_sf)
We formally checked missingness across all variables:
missing_summary <- sapply(analysis_2020, function(x) sum(is.na(x)))
missing_summary
## GEOID NAME.x STATEFP
## 0 0 0
## COUNTYFP state_name county_name
## 0 0 0
## housing_units tier1 tier2
## 0 0 0
## tier3 svi_overall svi_soc
## 0 0 0
## svi_hh svi_min svi_hous
## 0 0 0
## airband_fcc_availability airband_usage GEO_ID
## 0 0 0
## NAME.y internet_total_hh internet_broadband
## 0 0 0
## internet_no_access computer_no_device income_median
## 0 0 0
## edu_total_25plus edu_hs edu_bach
## 0 0 0
## edu_mast edu_doc z_income
## 0 0 0
## z_broadband_usage z_svi income_quartile
## 0 0 0
## broadband_quartile svi_quartile log_income
## 0 0 0
## log_total_hh
## 0
Summary: - Almost all missing values originated from counties removed
earlier (AK, territory counties).
- Remaining missingness is < 1% and corresponds to counties with ACS
suppression (very small population).
Because the cleaned dataset contains 99% complete rows, we use:
# Completeness
completeness_rate <- 1 - mean(is.na(analysis_2020))
completeness_rate
## [1] 1
# Correlation check between similar metrics
correlation_income_svi <- cor(analysis_2020$income_median,
analysis_2020$svi_soc,
use="complete.obs")
correlation_income_svi
## [1] -0.6171082
library(dplyr)
# Load the final non-spatial dataset if not already loaded
if (!exists("analysis_2020")) {
analysis_2020 <- readRDS("processed_data/analysis_2020_county_final.rds")
}
var_names <- names(analysis_2020)
# -------------------------
# 1) Define metadata lookup
# -------------------------
desc <- c(
GEOID = "5-digit county FIPS code (STATEFP + COUNTYFP)",
NAME.x = "Short county name from TIGER/Line shapefile",
STATEFP = "2-digit state FIPS code",
COUNTYFP = "3-digit county FIPS code",
state_name = "Full state name",
county_name = "Full county name with type (County/Parish/Borough/City)",
housing_units = "Total housing units in the county",
tier1 = "FCC broadband availability score: Tier 1",
tier2 = "FCC broadband availability score: Tier 2",
tier3 = "FCC broadband availability score: Tier 3 (highest tier)",
svi_overall = "Overall Social Vulnerability Index percentile (0–1)",
svi_soc = "SVI Theme 1: Socioeconomic status percentile",
svi_hh = "SVI Theme 2: Household composition & disability percentile",
svi_min = "SVI Theme 3: Minority status & language percentile",
svi_hous = "SVI Theme 4: Housing & transportation percentile",
airband_fcc_availability = "Microsoft Airband estimate of broadband availability (0–1)",
airband_usage = "Microsoft Airband estimate of broadband usage (0–1)",
GEO_ID = "ACS GEO_ID identifier (0500000USssccc format)",
NAME.y = "ACS county name with state (e.g., 'Cuyahoga County, Ohio')",
internet_total_hh = "Total households with internet status reported in ACS",
internet_broadband = "Households with a broadband internet subscription (ACS)",
internet_no_access = "Households with no internet access at home (ACS)",
computer_no_device = "Households without any computing device (ACS)",
income_median = "Median household income in dollars (ACS B19013)",
edu_total_25plus = "Total population aged 25 and over (ACS B15003)",
edu_hs = "Adults 25+ with high school diploma or equivalent",
edu_bach = "Adults 25+ with a bachelor’s degree",
edu_mast = "Adults 25+ with a master’s degree",
edu_doc = "Adults 25+ with a doctoral degree"
)
src <- c(
GEOID = "TIGER/Line county shapefile",
NAME.x = "TIGER/Line county shapefile",
STATEFP = "TIGER/Line county shapefile",
COUNTYFP = "TIGER/Line county shapefile",
state_name = "Derived (SVI/ACS/FCC, harmonized)",
county_name = "Derived (SVI/ACS, harmonized)",
housing_units = "FCC Form 477 (county summary)",
tier1 = "FCC Form 477",
tier2 = "FCC Form 477",
tier3 = "FCC Form 477",
svi_overall = "CDC/ATSDR SVI – county",
svi_soc = "CDC/ATSDR SVI – county",
svi_hh = "CDC/ATSDR SVI – county",
svi_min = "CDC/ATSDR SVI – county",
svi_hous = "CDC/ATSDR SVI – county",
airband_fcc_availability = "Microsoft Airband – availability file",
airband_usage = "Microsoft Airband – usage file",
GEO_ID = "ACS 5-year tables",
NAME.y = "ACS 5-year tables",
internet_total_hh = "ACS 5-year B28002",
internet_broadband = "ACS 5-year B28002",
internet_no_access = "ACS 5-year B28002",
computer_no_device = "ACS 5-year (computer ownership table)",
income_median = "ACS 5-year B19013",
edu_total_25plus = "ACS 5-year B15003",
edu_hs = "ACS 5-year B15003",
edu_bach = "ACS 5-year B15003",
edu_mast = "ACS 5-year B15003",
edu_doc = "ACS 5-year B15003"
)
year <- c(
GEOID = "2020", NAME.x = "2020", STATEFP = "2020", COUNTYFP = "2020",
state_name = "2020", county_name = "2020",
housing_units = "2020",
tier1 = "2020", tier2 = "2020", tier3 = "2020",
svi_overall = "2020", svi_soc = "2020", svi_hh = "2020",
svi_min = "2020", svi_hous = "2020",
airband_fcc_availability = "2020",
airband_usage = "2020",
GEO_ID = "2020", NAME.y = "2020",
internet_total_hh = "2020",
internet_broadband = "2020",
internet_no_access = "2020",
computer_no_device = "2020",
income_median = "2020",
edu_total_25plus = "2020",
edu_hs = "2020", edu_bach = "2020",
edu_mast = "2020", edu_doc = "2020"
)
proc <- c(
GEOID = "Combined STATEFP and COUNTYFP; used as master join key for all datasets.",
NAME.x = "Imported from TIGER; kept for mapping labels only.",
STATEFP = "Imported from TIGER; used to build GEOID and QC state_name.",
COUNTYFP = "Imported from TIGER; used to build GEOID.",
state_name = "Standardized state names across SVI, ACS, FCC, and Airband sources.",
county_name = "Standardized county names; harmonized between ACS and SVI.",
housing_units = "Selected from FCC county file; converted to numeric.",
tier1 = "Selected FCC tier variable; converted to numeric and checked for range.",
tier2 = "Same as tier1 for Tier 2 availability.",
tier3 = "Same as tier1 for Tier 3 (highest tier) availability.",
svi_overall = "Filtered to county SVI file; kept as percentile (0–1).",
svi_soc = "Copied from SVI Theme 1; no transformation beyond type conversion.",
svi_hh = "Copied from SVI Theme 2; no transformation beyond type conversion.",
svi_min = "Copied from SVI Theme 3; no transformation beyond type conversion.",
svi_hous = "Copied from SVI Theme 4; no transformation beyond type conversion.",
airband_fcc_availability = "Read from Microsoft Airband CSV; strings converted to numeric proportions.",
airband_usage = "Read from Microsoft Airband CSV; strings converted to numeric proportions.",
GEO_ID = "Retained original ACS GEO_ID for traceability back to ACS tables.",
NAME.y = "Retained ACS county name with state for documentation / QC.",
internet_total_hh = "Selected ACS estimate column; cast to numeric.",
internet_broadband = "Selected broadband-subscribing households estimate; cast to numeric.",
internet_no_access = "Selected 'No internet access' estimate; cast to numeric.",
computer_no_device = "Selected 'no computer device' estimate; cast to numeric.",
income_median = "Selected ACS median income estimate; cast to numeric (USD).",
edu_total_25plus = "Sum of all education categories for age 25+; numeric.",
edu_hs = "Sum of HS-diploma categories; numeric count.",
edu_bach = "Sum of bachelor’s degree categories; numeric count.",
edu_mast = "Sum of master’s degree categories; numeric count.",
edu_doc = "Sum of doctoral degree categories; numeric count."
)
# -------------------------
# 2) Build dictionary table
# -------------------------
data_dictionary <- data.frame(
variable = var_names,
description = ifelse(var_names %in% names(desc), desc[var_names],
"(description pending)"),
source = ifelse(var_names %in% names(src), src[var_names],
"(source pending)"),
year = ifelse(var_names %in% names(year), year[var_names],
"2020"),
processing_notes = ifelse(var_names %in% names(proc), proc[var_names],
"(processing notes pending)"),
stringsAsFactors = FALSE
)
# Preview
head(data_dictionary)
## variable description
## 1 GEOID 5-digit county FIPS code (STATEFP + COUNTYFP)
## 2 NAME.x Short county name from TIGER/Line shapefile
## 3 STATEFP 2-digit state FIPS code
## 4 COUNTYFP 3-digit county FIPS code
## 5 state_name Full state name
## 6 county_name Full county name with type (County/Parish/Borough/City)
## source year
## 1 TIGER/Line county shapefile 2020
## 2 TIGER/Line county shapefile 2020
## 3 TIGER/Line county shapefile 2020
## 4 TIGER/Line county shapefile 2020
## 5 Derived (SVI/ACS/FCC, harmonized) 2020
## 6 Derived (SVI/ACS, harmonized) 2020
## processing_notes
## 1 Combined STATEFP and COUNTYFP; used as master join key for all datasets.
## 2 Imported from TIGER; kept for mapping labels only.
## 3 Imported from TIGER; used to build GEOID and QC state_name.
## 4 Imported from TIGER; used to build GEOID.
## 5 Standardized state names across SVI, ACS, FCC, and Airband sources.
## 6 Standardized county names; harmonized between ACS and SVI.
# knitr::kable(data_dictionary, caption = "Data Dictionary for analysis_2020_county_final Dataset")
SECTION - II
ANALYTICS FOR DIGITAL DIVIDE MODELING
Phase 1 : Exploratory Data Analysis
This section examines the distributional properties of key continuous variables in the 2020 county-level broadband and social vulnerability dataset. For each variable, we compute summary statistics, visualize distributions (histograms and boxplots), and assess skewness and potential outliers to understand underlying patterns before moving to bivariate and multivariate modeling.
library(dplyr)
library(ggplot2)
library(moments) # for skewness & kurtosis
library(readr)
# Load final non-spatial dataset
analysis_2020 <- readRDS("processed_data/analysis_2020_county_final.rds")
# Select key variables for univariate analysis
uni_vars <- c(
"airband_usage",
"internet_broadband",
"internet_no_access",
"svi_overall",
"svi_soc",
"svi_hh",
"svi_min",
"svi_hous",
"income_median",
"edu_hs",
"edu_bach",
"edu_mast",
"edu_doc"
)
This loop produces:
for (var in uni_vars) {
cat("------------------------------------------------------------\n")
cat("Variable:", var, "\n")
cat("------------------------------------------------------------\n")
# Extract variable
x <- analysis_2020[[var]]
# Summary statistics
stats <- c(
mean = mean(x, na.rm=TRUE),
median = median(x, na.rm=TRUE),
sd = sd(x, na.rm=TRUE),
min = min(x, na.rm=TRUE),
max = max(x, na.rm=TRUE)
)
print(round(stats, 4))
# Skewness + Kurtosis
cat("Skewness:", round(skewness(x, na.rm=TRUE), 3), "\n")
cat("Kurtosis:", round(kurtosis(x, na.rm=TRUE), 3), "\n")
# Histogram with density
print(
ggplot(analysis_2020, aes(x = .data[[var]])) +
geom_histogram(aes(y = ..density..), bins = 30, fill = "#4A90E2", alpha = 0.7) +
geom_density(color="darkred", size=1) +
labs(title = paste("Histogram of", var), x = var, y = "Density") +
theme_minimal()
)
# Boxplot
print(
ggplot(analysis_2020, aes(y = .data[[var]])) +
geom_boxplot(fill="#7ED321", alpha=0.7) +
labs(title = paste("Boxplot of", var), y = var) +
theme_minimal()
)
}
## ------------------------------------------------------------
## Variable: airband_usage
## ------------------------------------------------------------
## mean median sd min max
## 0.3906 0.3690 0.2271 0.0010 1.0000
## Skewness: 0.323
## Kurtosis: 2.11
## ------------------------------------------------------------
## Variable: internet_broadband
## ------------------------------------------------------------
## mean median sd min max
## 33528.81 7921.50 103753.44 60.00 2904482.00
## Skewness: 12.028
## Kurtosis: 244.198
## ------------------------------------------------------------
## Variable: internet_no_access
## ------------------------------------------------------------
## mean median sd min max
## 1039.750 278.500 3063.827 0.000 84367.000
## Skewness: 12.093
## Kurtosis: 242.583
## ------------------------------------------------------------
## Variable: svi_overall
## ------------------------------------------------------------
## mean median sd min max
## 0.5001 0.5008 0.2886 0.0003 1.0000
## Skewness: -0.002
## Kurtosis: 1.8
## ------------------------------------------------------------
## Variable: svi_soc
## ------------------------------------------------------------
## mean median sd min max
## 0.5001 0.5005 0.2886 0.0000 1.0000
## Skewness: -0.001
## Kurtosis: 1.801
## ------------------------------------------------------------
## Variable: svi_hh
## ------------------------------------------------------------
## mean median sd min max
## 0.5008 0.5012 0.2882 0.0003 1.0000
## Skewness: 0
## Kurtosis: 1.802
## ------------------------------------------------------------
## Variable: svi_min
## ------------------------------------------------------------
## mean median sd min max
## 0.4971 0.4949 0.2889 0.0000 1.0000
## Skewness: 0.004
## Kurtosis: 1.798
## ------------------------------------------------------------
## Variable: svi_hous
## ------------------------------------------------------------
## mean median sd min max
## 0.4995 0.4992 0.2883 0.0000 1.0000
## Skewness: 0.004
## Kurtosis: 1.801
## ------------------------------------------------------------
## Variable: income_median
## ------------------------------------------------------------
## mean median sd min max
## 54997.11 52842.00 14611.32 22292.00 147111.00
## Skewness: 1.328
## Kurtosis: 6.566
## ------------------------------------------------------------
## Variable: edu_hs
## ------------------------------------------------------------
## mean median sd min max
## 16219.08 5341.00 44547.36 13.00 1278411.00
## Skewness: 12.279
## Kurtosis: 257.518
## ------------------------------------------------------------
## Variable: edu_bach
## ------------------------------------------------------------
## mean median sd min max
## 14429.29 2224.50 52051.08 0.00 1506714.00
## Skewness: 12.636
## Kurtosis: 268.605
## ------------------------------------------------------------
## Variable: edu_mast
## ------------------------------------------------------------
## mean median sd min max
## 6474.989 881.000 23258.000 0.000 542385.000
## Skewness: 10.072
## Kurtosis: 157.16
## ------------------------------------------------------------
## Variable: edu_doc
## ------------------------------------------------------------
## mean median sd min max
## 1040.304 109.500 4022.387 0.000 95031.000
## Skewness: 10.676
## Kurtosis: 167.995
Univariate Interpretation:
The following summarizes patterns observed across all variables.
Key Findings from Univariate Analysis
This analysis relationships between pairs of variables, focusing on:
Correlations between key continuous variables (broadband, SVI, income, education)
Group differences in broadband adoption across demographic and vulnerability categories
Tests of statistical significance for these relationships
These analyses help identify which socioeconomic and vulnerability factors are associated with broadband access and adoption.
library(dplyr)
library(ggplot2)
library(corrplot)
library(e1071)
# Load final non-spatial analysis dataset (with derived variables)
if (!exists("analysis_2020")) {
analysis_2020 <- readRDS("processed_data/analysis_2020_county_final.rds")
}
# 1) Continuous variables for correlation analysis
cont_vars <- c(
"airband_usage", # broadband adoption (0–1)
"internet_broadband", # broadband-subscribing households
"internet_no_access", # households with no internet
"svi_overall", # overall SVI
"svi_soc", # SVI socioeconomic
"svi_hh", # SVI household composition
"svi_min", # SVI minority & language
"svi_hous", # SVI housing & transportation
"income_median", # median income
"edu_hs", # HS grads
"edu_bach", # BA
"edu_mast", # MA
"edu_doc" # PhD
)
cont_df <- analysis_2020 %>%
dplyr::select(all_of(cont_vars)) %>%
mutate(across(everything(), as.numeric))
# 2) Grouping variables
analysis_2020 <- analysis_2020 %>%
mutate(
# Simple urban–rural proxy using housing/transport SVI (higher = more urban-like constraints)
urban_rural = ifelse(
svi_hous > median(svi_hous, na.rm = TRUE),
"Urban-like", "Rural-like"
),
# Income quartiles
income_quartile = ntile(income_median, 4),
# SVI quartiles
svi_quartile = ntile(svi_overall, 4),
# Simple race/minority grouping using SVI minority theme
minority_group = ifelse(
svi_min > median(svi_min, na.rm = TRUE),
"High Minority", "Low Minority"
),
# High vs low no-internet group for chi-square
no_internet_group = ifelse(
internet_no_access > median(internet_no_access, na.rm = TRUE),
"High No Internet", "Low No Internet"
)
)
This subsection examines linear and rank-based associations among key continuous variables, including broadband adoption, SVI scores, income, and education indicators. Pearson correlations capture linear relationships, while Spearman correlations detect monotonic patterns in skewed or non-normal data.
# Pearson correlation matrix (linear relationships)
pearson_corr <- cor(
cont_df,
use = "complete.obs",
method = "pearson"
)
# Spearman correlation matrix (rank-based relationships, robust for skewed distributions)
spearman_corr <- cor(
cont_df,
use = "complete.obs",
method = "spearman"
)
pearson_corr
## airband_usage internet_broadband internet_no_access
## airband_usage 1.0000000 0.35987100 0.33917869
## internet_broadband 0.3598710 1.00000000 0.94259579
## internet_no_access 0.3391787 0.94259579 1.00000000
## svi_overall -0.1342750 0.07820194 0.11658422
## svi_soc -0.2342107 0.02198559 0.06604340
## svi_hh -0.1721350 -0.04157781 -0.01908247
## svi_min 0.1267733 0.23316314 0.23190181
## svi_hous 0.0458706 0.13114055 0.16933534
## income_median 0.5542580 0.30121628 0.21621494
## edu_hs 0.3523251 0.97493820 0.95087390
## edu_bach 0.3451916 0.98667193 0.91371305
## edu_mast 0.3528526 0.95932250 0.87713449
## edu_doc 0.3308670 0.90626162 0.82153942
## svi_overall svi_soc svi_hh svi_min svi_hous
## airband_usage -0.13427502 -0.234210653 -0.172134982 0.12677328 0.0458706
## internet_broadband 0.07820194 0.021985587 -0.041577814 0.23316314 0.1311405
## internet_no_access 0.11658422 0.066043405 -0.019082475 0.23190181 0.1693353
## svi_overall 1.00000000 0.921546105 0.717993894 0.67659954 0.7422894
## svi_soc 0.92154611 1.000000000 0.591054768 0.53540270 0.5713606
## svi_hh 0.71799389 0.591054768 1.000000000 0.44455327 0.2776848
## svi_min 0.67659954 0.535402695 0.444553272 1.00000000 0.4544270
## svi_hous 0.74228935 0.571360609 0.277684821 0.45442702 1.0000000
## income_median -0.52156989 -0.617108159 -0.418106230 -0.01018745 -0.3099417
## edu_hs 0.10299484 0.050536011 -0.009174316 0.22638045 0.1395303
## edu_bach 0.04834217 -0.006251488 -0.071689472 0.22306311 0.1113723
## edu_mast 0.03982396 -0.019782478 -0.087031894 0.22757288 0.1174173
## edu_doc 0.03696876 -0.021962786 -0.105161336 0.21910752 0.1349869
## income_median edu_hs edu_bach edu_mast
## airband_usage 0.55425798 0.352325065 0.345191616 0.35285256
## internet_broadband 0.30121628 0.974938205 0.986671926 0.95932250
## internet_no_access 0.21621494 0.950873896 0.913713052 0.87713449
## svi_overall -0.52156989 0.102994839 0.048342167 0.03982396
## svi_soc -0.61710816 0.050536011 -0.006251488 -0.01978248
## svi_hh -0.41810623 -0.009174316 -0.071689472 -0.08703189
## svi_min -0.01018745 0.226380451 0.223063112 0.22757288
## svi_hous -0.30994168 0.139530322 0.111372277 0.11741727
## income_median 1.00000000 0.253471425 0.333033009 0.36633185
## edu_hs 0.25347142 1.000000000 0.939690446 0.89796951
## edu_bach 0.33303301 0.939690446 1.000000000 0.97942366
## edu_mast 0.36633185 0.897969508 0.979423663 1.00000000
## edu_doc 0.35425648 0.829365398 0.933997063 0.96536043
## edu_doc
## airband_usage 0.33086704
## internet_broadband 0.90626162
## internet_no_access 0.82153942
## svi_overall 0.03696876
## svi_soc -0.02196279
## svi_hh -0.10516134
## svi_min 0.21910752
## svi_hous 0.13498694
## income_median 0.35425648
## edu_hs 0.82936540
## edu_bach 0.93399706
## edu_mast 0.96536043
## edu_doc 1.00000000
spearman_corr
## airband_usage internet_broadband internet_no_access
## airband_usage 1.00000000 0.678855253 0.619695722
## internet_broadband 0.67885525 1.000000000 0.917223130
## internet_no_access 0.61969572 0.917223130 1.000000000
## svi_overall -0.13764177 0.084139599 0.146038069
## svi_soc -0.23725601 -0.007205396 0.062940162
## svi_hh -0.17331767 -0.042014503 -0.009425705
## svi_min 0.10723174 0.198691432 0.219162462
## svi_hous 0.05127622 0.223872844 0.273571286
## income_median 0.52873313 0.414921127 0.303202299
## edu_hs 0.60628547 0.977091466 0.911555761
## edu_bach 0.71229066 0.980387439 0.902586492
## edu_mast 0.68682076 0.974075883 0.899653265
## edu_doc 0.68316647 0.916753088 0.852189529
## svi_overall svi_soc svi_hh svi_min
## airband_usage -0.13764177 -0.237256009 -0.173317675 0.10723174
## internet_broadband 0.08413960 -0.007205396 -0.042014503 0.19869143
## internet_no_access 0.14603807 0.062940162 -0.009425705 0.21916246
## svi_overall 1.00000000 0.921522325 0.718009601 0.67648787
## svi_soc 0.92152233 1.000000000 0.591089074 0.53510831
## svi_hh 0.71800960 0.591089074 1.000000000 0.44452088
## svi_min 0.67648787 0.535108312 0.444520876 1.00000000
## svi_hous 0.74229950 0.571355586 0.277817815 0.45422514
## income_median -0.57256800 -0.676159858 -0.440670196 -0.07835337
## edu_hs 0.13108686 0.049221297 -0.001225367 0.19175479
## edu_bach 0.01993390 -0.080941030 -0.097852469 0.20831011
## edu_mast 0.04932357 -0.038264201 -0.095146182 0.20638836
## edu_doc 0.03907149 -0.047609908 -0.122548919 0.21325256
## svi_hous income_median edu_hs edu_bach
## airband_usage 0.05127622 0.52873313 0.606285474 0.71229066
## internet_broadband 0.22387284 0.41492113 0.977091466 0.98038744
## internet_no_access 0.27357129 0.30320230 0.911555761 0.90258649
## svi_overall 0.74229950 -0.57256800 0.131086862 0.01993390
## svi_soc 0.57135559 -0.67615986 0.049221297 -0.08094103
## svi_hh 0.27781781 -0.44067020 -0.001225367 -0.09785247
## svi_min 0.45422514 -0.07835337 0.191754789 0.20831011
## svi_hous 1.00000000 -0.33134309 0.242323242 0.18515016
## income_median -0.33134309 1.00000000 0.337025484 0.49081525
## edu_hs 0.24232324 0.33702548 1.000000000 0.93794201
## edu_bach 0.18515016 0.49081525 0.937942007 1.00000000
## edu_mast 0.20789297 0.44517352 0.937040449 0.97728444
## edu_doc 0.21462358 0.44035828 0.869140230 0.93248015
## edu_mast edu_doc
## airband_usage 0.68682076 0.68316647
## internet_broadband 0.97407588 0.91675309
## internet_no_access 0.89965327 0.85218953
## svi_overall 0.04932357 0.03907149
## svi_soc -0.03826420 -0.04760991
## svi_hh -0.09514618 -0.12254892
## svi_min 0.20638836 0.21325256
## svi_hous 0.20789297 0.21462358
## income_median 0.44517352 0.44035828
## edu_hs 0.93704045 0.86914023
## edu_bach 0.97728444 0.93248015
## edu_mast 1.00000000 0.93653469
## edu_doc 0.93653469 1.00000000
airband_usage) and both
income and education variables.svi_soc) is positively
correlated with lack of internet and negatively correlated with
broadband adoption, consistent with vulnerability patterns.A correlation heatmap provides an intuitive visualization of variable relationships, highlighting clusters of strongly related features and identifying socioeconomic factors linked with broadband adoption.
# Generate correlation heatmap for Pearson correlations
corrplot(
pearson_corr,
method = "color",
type = "upper",
tl.col = "black",
tl.cex = 0.8,
addCoef.col = "black",
number.cex = 0.6,
title = "Pearson Correlation Heatmap",
mar = c(0, 0, 2, 0)
)
## 1.6 Spatial Exploratory Data Analysis (Spatial EDA)
Spatial EDA helps identify geographic patterns, regional clustering, and spatial inequalities in broadband usage and social vulnerability. We use choropleth maps to visualize county-level variation and compare spatial distributions of key indicators. ## 1.6.1 Setup: Load Spatial Dataset
We use the final spatial dataset saved earlier (the sf object).
library(sf)
library(ggplot2)
library(dplyr)
library(patchwork) # for side-by-side plots
# Load spatial dataset
analysis_2020_sf <- readRDS("processed_data/analysis_2020_county_final_sf.rds")
Broadband adoption varies significantly across U.S. counties. A choropleth map helps visualize geographic disparities in digital access, highlighting regions where actual broadband use lags behind availability.
# Broadband Adoption Map (Airband Usage)
ggplot(analysis_2020_sf) +
geom_sf(aes(fill = airband_usage), color = NA) +
scale_fill_viridis_c(option = "plasma", name = "Broadband Usage (0–1)") +
labs(
title = "County-Level Broadband Adoption (Microsoft Airband, 2020)",
subtitle = "Measured broadband usage as a proportion of total households"
) +
theme_minimal() +
theme(
legend.position = "right",
plot.title = element_text(size = 14, face = "bold")
)
### 1.6.3 Choropleth Map: Overall Social Vulnerability Index (SVI)
This map visualizes how social vulnerability varies across counties. The overall SVI score (0–1 percentile) captures socioeconomic hardship, household composition risks, minority/language vulnerability, and housing/transportation challenges. Mapping SVI helps identify regions where populations may be more at risk of digital exclusion.
# SVI Overall Score Map
ggplot(analysis_2020_sf) +
geom_sf(aes(fill = svi_overall), color = NA) +
scale_fill_viridis_c(option = "magma", name = "SVI Overall (0–1)") +
labs(
title = "County-Level Social Vulnerability Index (SVI), 2020",
subtitle = "Higher values indicate greater overall social vulnerability"
) +
theme_minimal() +
theme(
legend.position = "right",
plot.title = element_text(size = 14, face = "bold")
)
To visually examine the relationship between social vulnerability and digital access, we create side-by-side choropleth maps. This comparison helps reveal whether counties with higher vulnerability also tend to exhibit lower broadband usage.
library(patchwork)
# Map 1: Social Vulnerability Index
p1 <- ggplot(analysis_2020_sf) +
geom_sf(aes(fill = svi_overall), color = NA) +
scale_fill_viridis_c(option = "magma", name = "SVI Overall (0–1)") +
labs(
title = "Social Vulnerability Index by County",
subtitle = "Higher values indicate greater vulnerability"
) +
theme_minimal()
# Map 2: Broadband Usage
p2 <- ggplot(analysis_2020_sf) +
geom_sf(aes(fill = airband_usage), color = NA) +
scale_fill_viridis_c(option = "plasma", name = "Broadband Usage (0–1)") +
labs(
title = "Broadband Adoption by County",
subtitle = "Measured broadband usage among households"
) +
theme_minimal()
# Combine side-by-side
p1 + p2
A visual inspection of the choropleth maps reveals several clear regional clusters and spatial patterns linking broadband adoption and social vulnerability across the United States:
These counties exhibit high SVI and low broadband usage, forming several distinct clusters:
Deep South: Mississippi, Alabama, Arkansas, Louisiana, and parts of Georgia show pronounced overlap between high vulnerability and low adoption. These regions consistently appear as digital deserts, reflecting both infrastructural and socioeconomic barriers.
Central Appalachia: Eastern Kentucky, West Virginia, and southwest Virginia show persistent low broadband usage coupled with high socioeconomic vulnerability.
Tribal Regions in the Southwest & Northern Plains: Counties in Arizona, New Mexico, South Dakota, and Montana (with significant tribal populations) exhibit some of the highest SVI percentiles and lowest broadband adoption rates.
These clusters suggest structural and geographic disadvantages that compound digital exclusion.
Areas with low SVI and high broadband adoption cluster around:
Northeast Corridor: From Washington D.C. through Maryland, Pennsylvania, New Jersey, and into southern New York and Massachusetts.
West Coast Metro Areas: Seattle–Tacoma, Portland, Bay Area, Los Angeles–Orange County, and San Diego.
Upper Midwest Cities: Minneapolis–St. Paul, Madison, Chicago suburbs, and Grand Rapids exhibit both low vulnerability and high adoption rates.
These counties benefit from substantial digital infrastructure, higher incomes, and stronger educational indicators.
Some areas show intermediate or mixed patterns, where vulnerability and broadband adoption vary county-by-county:
Midwest rural counties: Iowa, Kansas, and Nebraska display moderate SVI but still struggle with uneven broadband adoption outside metro centers.
Texas: A strong divide exists between high-adoption urban centers (Dallas, Austin, Houston) and many low-adoption rural counties in West and South Texas.
Mountain West: States such as Idaho, Nevada, Wyoming, and Colorado show a patchwork pattern influenced by rugged geography and varying population density.
Across all maps, several broader patterns emerge:
Urban–rural divide: Urban counties consistently show higher broadband usage and lower SVI, while rural counties show the opposite.
South vs. North divide: The Southern U.S. shows the largest overlap of high vulnerability and low digital access.
Coastal vs. Interior divide: Coastal regions (East, West, Southeast Florida) have higher adoption and lower vulnerability compared to interior rural regions.
#2.1 Prepare Data for Regression
Broadband adoption (airband_usage) → dependent variable Socioeconomic variables (income, education, SVI) → predictors Cluster membership (cluster_k3) → categorical predictor
library(dplyr)
library(ggplot2)
library(car) # VIF
library(broom) # tidy model output
library(readr)
# Load the final cleaned + derived dataset
analysis_2020 <- readRDS("processed_data/analysis_2020_county_final.rds")
# Check the first few rows
head(analysis_2020)
## GEOID NAME.x STATEFP COUNTYFP state_name county_name housing_units
## 1 31039 Cuming 31 039 NE Cuming County 5494
## 2 53069 Wahkiakum 53 069 WA Wahkiakum County 2790
## 3 35011 De Baca 35 011 NM De Baca County 1531
## 4 31109 Lancaster 31 109 NE Lancaster County 148105
## 5 31129 Nuckolls 31 129 NE Nuckolls County 3396
## 6 46099 Minnehaha 46 099 SD Minnehaha County 96413
## tier1 tier2 tier3 svi_overall svi_soc svi_hh svi_min svi_hous
## 1 4 4 3 0.1610 0.1101 0.5738 0.3762 0.1439
## 2 4 2 2 0.1706 0.2314 0.2167 0.4411 0.1467
## 3 4 4 3 0.5891 0.7782 0.2447 0.9621 0.2775
## 4 4 4 4 0.3781 0.2629 0.2412 0.5407 0.6785
## 5 4 4 2 0.0707 0.0904 0.2371 0.1359 0.1248
## 6 4 4 4 0.4013 0.1677 0.5484 0.5242 0.7072
## airband_fcc_availability airband_usage GEO_ID
## 1 0.8900 0.284 0500000US31039
## 2 0.6246 0.273 0500000US53069
## 3 0.8124 0.455 0500000US35011
## 4 1.0000 0.677 0500000US31109
## 5 0.7406 0.429 0500000US31129
## 6 0.9855 0.692 0500000US46099
## NAME.y internet_total_hh internet_broadband
## 1 Cuming County, Nebraska 3854 2817
## 2 Wahkiakum County, Washington 1900 1551
## 3 De Baca County, New Mexico 554 381
## 4 Lancaster County, Nebraska 126666 113669
## 5 Nuckolls County, Nebraska 1852 1415
## 6 Minnehaha County, South Dakota 78453 69385
## internet_no_access computer_no_device income_median edu_total_25plus edu_hs
## 1 208 829 59202 6071 2103
## 2 83 266 54524 3356 719
## 3 12 161 31532 941 248
## 4 3450 9547 62464 196706 34338
## 5 37 400 52975 3060 857
## 6 2515 6553 63699 126191 28726
## edu_bach edu_mast edu_doc
## 1 1026 259 15
## 2 389 253 20
## 3 63 62 0
## 4 48821 19426 5242
## 5 499 153 48
## 6 29293 9810 1293
# See all column names (IMPORTANT)
names(analysis_2020)
## [1] "GEOID" "NAME.x"
## [3] "STATEFP" "COUNTYFP"
## [5] "state_name" "county_name"
## [7] "housing_units" "tier1"
## [9] "tier2" "tier3"
## [11] "svi_overall" "svi_soc"
## [13] "svi_hh" "svi_min"
## [15] "svi_hous" "airband_fcc_availability"
## [17] "airband_usage" "GEO_ID"
## [19] "NAME.y" "internet_total_hh"
## [21] "internet_broadband" "internet_no_access"
## [23] "computer_no_device" "income_median"
## [25] "edu_total_25plus" "edu_hs"
## [27] "edu_bach" "edu_mast"
## [29] "edu_doc"
This model follows the baseline structure recommended in the Analysis Guidelines for Digital Divide Modeling, where broadband adoption is explained by overarching socioeconomic and demographic factors. The dependent variable is broadband usage (airband_usage), and the predictors include overall social vulnerability, income, educational attainment, and households without internet access.
model1 <- lm(
airband_usage ~ svi_overall + income_median + edu_bach + internet_no_access,
data = analysis_2020
)
summary(model1)
##
## Call:
## lm(formula = airband_usage ~ svi_overall + income_median + edu_bach +
## internet_no_access, data = analysis_2020)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.78497 -0.13102 -0.00633 0.12402 0.71939
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.273e-01 2.025e-02 -11.228 < 2e-16 ***
## svi_overall 1.250e-01 1.354e-02 9.233 < 2e-16 ***
## income_median 9.807e-06 2.864e-07 34.238 < 2e-16 ***
## edu_bach -1.082e-06 1.612e-07 -6.711 2.29e-11 ***
## internet_no_access 3.045e-05 2.645e-06 11.515 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1787 on 3115 degrees of freedom
## Multiple R-squared: 0.3821, Adjusted R-squared: 0.3813
## F-statistic: 481.5 on 4 and 3115 DF, p-value: < 2.2e-16
Model 1 shows that broadband adoption is associated with structural socioeconomic factors. Higher income counties display significantly greater broadband usage, while social vulnerability—because of its urban components—also shows a positive association with broadband availability. Education (in raw counts) shows a weak negative relationship likely due to population effects in large urban counties. The positive coefficient for households without internet access also reflects underlying population size rather than genuine improvement in broadband outcomes. Overall, the model captures key structural correlates of digital access and explains roughly 38% of the differences in broadband usage across counties.
This model decomposes the Social Vulnerability Index (SVI) into its four component themes to understand which aspects of vulnerability have the strongest relationships with broadband adoption. This corresponds directly to the professor’s guideline:
The dependent variable remains airband_usage, while predictors include:
SVI Theme 1 — Socioeconomic Status (svi_soc) SVI Theme 2 — Household Composition & Disability (svi_hh) SVI Theme 3 — Minority Status & Language (svi_min) SVI Theme 4 — Housing & Transportation (svi_hous) Median household income as a control
model2 <- lm(
airband_usage ~ svi_soc + svi_hh + svi_min + svi_hous + income_median,
data = analysis_2020
)
summary(model2)
##
## Call:
## lm(formula = airband_usage ~ svi_soc + svi_hh + svi_min + svi_hous +
## income_median, data = analysis_2020)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.67425 -0.13612 -0.00838 0.12547 0.76360
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.627e-01 2.373e-02 -11.073 <2e-16 ***
## svi_soc 9.896e-03 2.089e-02 0.474 0.636
## svi_hh 1.421e-02 1.472e-02 0.966 0.334
## svi_min 1.308e-02 1.625e-02 0.805 0.421
## svi_hous 1.767e-01 1.426e-02 12.395 <2e-16 ***
## income_median 9.937e-06 3.308e-07 30.039 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1818 on 3114 degrees of freedom
## Multiple R-squared: 0.3606, Adjusted R-squared: 0.3596
## F-statistic: 351.3 on 5 and 3114 DF, p-value: < 2.2e-16
This model tests whether social vulnerability influences broadband adoption differently in “urban-like” and “rural-like” counties. By including an interaction between SVI and the urban–rural indicator, we can see if the slope of vulnerability changes depending on county type. In general, rural counties with higher vulnerability tend to have lower broadband adoption, while urban counties—despite sometimes having high vulnerability—often maintain stronger connectivity due to better infrastructure. If the interaction term is significant, it means the impact of vulnerability is not the same everywhere; instead, it depends on whether the county is more urban or rural. This helps show how structural context shapes digital access.
## ============================================================
## Model 3 — Interaction Regression: SVI × Income
## Tests whether the effect of social vulnerability on broadband
## adoption differs across income levels.
## ============================================================
model3 <- lm(
airband_usage ~ svi_overall * income_median + edu_bach + internet_no_access,
data = analysis_2020
)
summary(model3)
##
## Call:
## lm(formula = airband_usage ~ svi_overall * income_median + edu_bach +
## internet_no_access, data = analysis_2020)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.7365 -0.1306 -0.0050 0.1240 0.7138
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.308e-01 2.665e-02 -4.907 9.74e-07 ***
## svi_overall -1.035e-01 4.343e-02 -2.384 0.0172 *
## income_median 8.111e-06 4.184e-07 19.384 < 2e-16 ***
## edu_bach -1.056e-06 1.605e-07 -6.578 5.57e-11 ***
## internet_no_access 2.881e-05 2.649e-06 10.875 < 2e-16 ***
## svi_overall:income_median 4.440e-06 8.020e-07 5.536 3.36e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1778 on 3114 degrees of freedom
## Multiple R-squared: 0.3881, Adjusted R-squared: 0.3871
## F-statistic: 395 on 5 and 3114 DF, p-value: < 2.2e-16
This model tests whether social vulnerability influences broadband adoption differently in “urban-like” and “rural-like” counties. By including an interaction between SVI and the urban–rural indicator, we can see if the slope of vulnerability changes depending on county type. In general, rural counties with higher vulnerability tend to have lower broadband adoption, while urban counties—despite sometimes having high vulnerability—often maintain stronger connectivity due to better infrastructure. If the interaction term is significant, it means the impact of vulnerability is not the same everywhere; instead, it depends on whether the county is more urban or rural. This helps show how structural context shapes digital access.
library(dplyr)
# 1) Create broadband_gap if it doesn't already exist
analysis_2020 <- analysis_2020 %>%
mutate(
broadband_gap = airband_fcc_availability - airband_usage
)
# 2) Fit the broadband gap regression model
model_gap <- lm(
broadband_gap ~ svi_overall + income_median + edu_bach + internet_no_access,
data = analysis_2020
)
summary(model_gap)
##
## Call:
## lm(formula = broadband_gap ~ svi_overall + income_median + edu_bach +
## internet_no_access, data = analysis_2020)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.19992 -0.11324 -0.00092 0.12329 0.54215
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.896e-01 2.188e-02 36.092 <2e-16 ***
## svi_overall -1.420e-01 1.463e-02 -9.703 <2e-16 ***
## income_median -4.763e-06 3.095e-07 -15.389 <2e-16 ***
## edu_bach 2.199e-08 1.742e-07 0.126 0.8996
## internet_no_access -6.846e-06 2.858e-06 -2.395 0.0167 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1931 on 3115 degrees of freedom
## Multiple R-squared: 0.1125, Adjusted R-squared: 0.1114
## F-statistic: 98.73 on 4 and 3115 DF, p-value: < 2.2e-16
The broadband gap model examines why some counties have broadband infrastructure available (FCC) but much lower actual usage (Airband). Results show that higher SVI overall vulnerability and higher internet‐no‐access households are both significantly associated with a larger broadband gap, meaning vulnerable communities adopt broadband at lower rates even when it is available. Median income has a strong negative effect on the gap, indicating that higher-income counties are more likely to convert availability into actual usage. Education (bachelor’s attainment) was not significant.
Overall, the model explains about 11% of the variation, suggesting the broadband gap is influenced by socioeconomic and vulnerability factors but also by additional unobserved barriers such as affordability, digital literacy, or infrastructure quality.
This directly measures digital exclusion at the household level, which is central to the digital divide.
## ============================================================
## Model A: Internet No-Access Regression
## ============================================================
library(dplyr)
# Fit regression model predicting households with no internet access
model_nointernet <- lm(
internet_no_access ~ svi_overall + income_median +
edu_bach + tier3 + airband_usage,
data = analysis_2020
)
summary(model_nointernet)
##
## Call:
## lm(formula = internet_no_access ~ svi_overall + income_median +
## edu_bach + tier3 + airband_usage, data = analysis_2020)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12844.2 -327.0 -74.1 190.6 18403.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.176e+03 1.388e+02 8.471 < 2e-16 ***
## svi_overall 1.216e+02 9.085e+01 1.338 0.180880
## income_median -2.951e-02 2.161e-03 -13.655 < 2e-16 ***
## edu_bach 5.436e-02 4.560e-04 119.215 < 2e-16 ***
## tier3 9.239e+01 2.538e+01 3.640 0.000277 ***
## airband_usage 1.052e+03 1.408e+02 7.471 1.03e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1183 on 3114 degrees of freedom
## Multiple R-squared: 0.8511, Adjusted R-squared: 0.8509
## F-statistic: 3561 on 5 and 3114 DF, p-value: < 2.2e-16
This model examines which factors best explain the number of households without any internet access at the county level.
Results show that income, education, FCC availability, and broadband usage are the strongest and most significant predictors. Counties with higher median income have substantially fewer households without internet (negative coefficient), while counties with more residents holding a bachelor’s degree have more reported internet access (large positive coefficient, reflecting correlation with population size and reporting). Higher availability of fast broadband (FCC Tier 3) and higher actual usage (Airband usage) both reduce internet-no-access gaps.
Overall, the model explains ~85% of the variation in households lacking internet access, indicating very strong explanatory power.
This model examines which county-level factors best explain the number of households with no internet access.
## ============================================================
## Model C: Digital Vulnerability Regression
## ============================================================
library(dplyr)
# Ensure digital vulnerability measure exists
analysis_2020 <- analysis_2020 %>%
mutate(
digital_vulnerability_score = 0.5 * svi_overall +
0.5 * (1 - airband_usage)
)
# Fit the model
model_dv <- lm(
digital_vulnerability_score ~ income_median +
internet_no_access + edu_bach + tier3,
data = analysis_2020
)
summary(model_dv)
##
## Call:
## lm(formula = digital_vulnerability_score ~ income_median + internet_no_access +
## edu_bach + tier3, data = analysis_2020)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.61056 -0.09089 -0.00439 0.09570 0.51051
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.143e+00 1.008e-02 113.417 < 2e-16 ***
## income_median -9.004e-06 1.849e-07 -48.684 < 2e-16 ***
## internet_no_access -4.518e-06 1.963e-06 -2.302 0.0214 *
## edu_bach 7.056e-07 1.182e-07 5.968 2.67e-09 ***
## tier3 -3.941e-02 2.341e-03 -16.837 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1308 on 3115 degrees of freedom
## Multiple R-squared: 0.5515, Adjusted R-squared: 0.5509
## F-statistic: 957.6 on 4 and 3115 DF, p-value: < 2.2e-16
This model explains digital vulnerability using socioeconomic and broadband-related predictors. Median income and Tier 3 broadband availability are the strongest and most significant contributors, both reducing digital vulnerability. Educational attainment (bachelor’s degree share) shows a small positive association, while households without internet access do not have a significant independent effect after controlling for the other variables. Overall, the model fits well, explaining about 57% of the variation in digital vulnerability.
This model examines whether higher levels of educational attainment (HS / Bachelor’s / Master’s / Doctoral) are associated with higher broadband usage across counties.
# 3.8 Education & Broadband Adoption Regression
# Dependent variable: airband_usage (broadband usage rate)
# Predictors: education levels + income
model_edu <- lm(
airband_usage ~ edu_hs + edu_bach + edu_mast + edu_doc + income_median,
data = analysis_2020
)
summary(model_edu)
##
## Call:
## lm(formula = airband_usage ~ edu_hs + edu_bach + edu_mast + edu_doc +
## income_median, data = analysis_2020)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.95443 -0.13342 -0.00671 0.12431 0.71834
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.160e-02 1.362e-02 -5.989 2.35e-09 ***
## edu_hs 2.787e-06 2.343e-07 11.894 < 2e-16 ***
## edu_bach -2.029e-06 4.196e-07 -4.835 1.40e-06 ***
## edu_mast 5.205e-07 9.842e-07 0.529 0.597
## edu_doc 4.202e-06 3.273e-06 1.284 0.199
## income_median 8.156e-06 2.430e-07 33.565 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.181 on 3114 degrees of freedom
## Multiple R-squared: 0.3663, Adjusted R-squared: 0.3653
## F-statistic: 360 on 5 and 3114 DF, p-value: < 2.2e-16
library(sf)
library(spdep)
# analysis_2020_sf must already exist and match analysis_2020 rows
# Build contiguity neighbors (shared borders, queen = TRUE)
nb_contig <- poly2nb(analysis_2020_sf, queen = TRUE)
# Row-standardized weights (each row sums to 1)
lw_contig <- nb2listw(nb_contig, style = "W", zero.policy = TRUE)
## ============================================================
## 3.x Model Diagnostics for All Regression Models
## ============================================================
library(ggplot2)
library(car) # VIF = Variance Inflation Factor
library(lmtest) # Breusch–Pagan test
# lw_contig and analysis_2020_sf should already be created in spatial-weights-setup
# Collect all models in a named list
models <- list(
Model1_Basic = model1,
Model2_SVI_Themes = model2,
Model3_Interaction = model3,
Model4_Broadband_Gap = model_gap,
Model5_Internet_NoAccess = model_nointernet,
Model6_Digital_Vulnerability = model_dv,
Model7_Education_Broadband = model_edu
)
# Loop through each model and run diagnostics
for (nm in names(models)) {
cat("\n============================\n")
cat("Diagnostics for", nm, "\n")
cat("============================\n\n")
m <- models[[nm]]
# -----------------------------
# 1) Residuals vs Fitted Plot
# -----------------------------
diag_df <- data.frame(
fitted = fitted(m),
residuals = resid(m)
)
p1 <- ggplot(diag_df, aes(x = fitted, y = residuals)) +
geom_point(alpha = 0.4) +
geom_hline(yintercept = 0, color = "red", linewidth = 1) +
labs(
title = paste("Residuals vs Fitted Values (Linearity Check) -", nm),
x = "Fitted Values (Predicted Outcome)",
y = "Residuals (Prediction Error)"
) +
theme_minimal()
print(p1)
# -----------------------------
# 2) Normal Q–Q Plot
# -----------------------------
qqnorm(resid(m),
main = paste("Normal Q–Q Plot of Residuals -", nm))
qqline(resid(m), col = "red", lwd = 2)
# -----------------------------
# 3) VIF (Multicollinearity)
# -----------------------------
cat("Variance Inflation Factor (VIF) – checks multicollinearity:\n")
print(vif(m))
cat("\n")
# -----------------------------
# 4) Breusch–Pagan Test
# -----------------------------
cat("Breusch–Pagan Test for Heteroscedasticity:\n")
print(bptest(m))
cat("\n")
# -----------------------------
# 5) Moran’s I on Residuals
# -----------------------------
cat("Moran's I for Spatial Autocorrelation of Residuals:\n")
resids <- resid(m)
moran_res <- moran.test(resids, lw_contig, zero.policy = TRUE)
print(moran_res)
cat("\n\n")
}
##
## ============================
## Diagnostics for Model1_Basic
## ============================
## Variance Inflation Factor (VIF) – checks multicollinearity:
## svi_overall income_median edu_bach internet_no_access
## 1.492163 1.711535 6.881915 6.416132
##
## Breusch–Pagan Test for Heteroscedasticity:
##
## studentized Breusch-Pagan test
##
## data: m
## BP = 85.664, df = 4, p-value < 2.2e-16
##
##
## Moran's I for Spatial Autocorrelation of Residuals:
##
## Moran I test under randomisation
##
## data: resids
## weights: lw_contig
## n reduced by no-neighbour observations
##
## Moran I statistic standard deviate = 12.994, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.1375633114 -0.0003209243 0.0001126088
##
##
##
##
## ============================
## Diagnostics for Model2_SVI_Themes
## ============================
## Variance Inflation Factor (VIF) – checks multicollinearity:
## svi_soc svi_hh svi_min svi_hous income_median
## 3.432815 1.699183 2.080043 1.594241 2.205555
##
## Breusch–Pagan Test for Heteroscedasticity:
##
## studentized Breusch-Pagan test
##
## data: m
## BP = 25.462, df = 5, p-value = 0.0001135
##
##
## Moran's I for Spatial Autocorrelation of Residuals:
##
## Moran I test under randomisation
##
## data: resids
## weights: lw_contig
## n reduced by no-neighbour observations
##
## Moran I statistic standard deviate = 16.427, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.1740024574 -0.0003209243 0.0001126144
##
##
##
##
## ============================
## Diagnostics for Model3_Interaction
## ============================
## Variance Inflation Factor (VIF) – checks multicollinearity:
## svi_overall income_median edu_bach
## 15.501251 3.687544 6.887790
## internet_no_access svi_overall:income_median
## 6.498049 11.868711
##
## Breusch–Pagan Test for Heteroscedasticity:
##
## studentized Breusch-Pagan test
##
## data: m
## BP = 71.794, df = 5, p-value = 4.335e-14
##
##
## Moran's I for Spatial Autocorrelation of Residuals:
##
## Moran I test under randomisation
##
## data: resids
## weights: lw_contig
## n reduced by no-neighbour observations
##
## Moran I statistic standard deviate = 13.307, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.1408927007 -0.0003209243 0.0001126083
##
##
##
##
## ============================
## Diagnostics for Model4_Broadband_Gap
## ============================
## Variance Inflation Factor (VIF) – checks multicollinearity:
## svi_overall income_median edu_bach internet_no_access
## 1.492163 1.711535 6.881915 6.416132
##
## Breusch–Pagan Test for Heteroscedasticity:
##
## studentized Breusch-Pagan test
##
## data: m
## BP = 67.021, df = 4, p-value = 9.649e-14
##
##
## Moran's I for Spatial Autocorrelation of Residuals:
##
## Moran I test under randomisation
##
## data: resids
## weights: lw_contig
## n reduced by no-neighbour observations
##
## Moran I statistic standard deviate = 18.894, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.2001540818 -0.0003209243 0.0001125876
##
##
##
##
## ============================
## Diagnostics for Model5_Internet_NoAccess
## ============================
## Variance Inflation Factor (VIF) – checks multicollinearity:
## svi_overall income_median edu_bach tier3 airband_usage
## 1.532123 2.222559 1.255492 1.769324 2.277612
##
## Breusch–Pagan Test for Heteroscedasticity:
##
## studentized Breusch-Pagan test
##
## data: m
## BP = 593.21, df = 5, p-value < 2.2e-16
##
##
## Moran's I for Spatial Autocorrelation of Residuals:
##
## Moran I test under randomisation
##
## data: resids
## weights: lw_contig
## n reduced by no-neighbour observations
##
## Moran I statistic standard deviate = 8.8022, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.0920107328 -0.0003209243 0.0001100319
##
##
##
##
## ============================
## Diagnostics for Model6_Digital_Vulnerability
## ============================
## Variance Inflation Factor (VIF) – checks multicollinearity:
## income_median internet_no_access edu_bach tier3
## 1.330401 6.587338 6.899781 1.230023
##
## Breusch–Pagan Test for Heteroscedasticity:
##
## studentized Breusch-Pagan test
##
## data: m
## BP = 47.629, df = 4, p-value = 1.128e-09
##
##
## Moran's I for Spatial Autocorrelation of Residuals:
##
## Moran I test under randomisation
##
## data: resids
## weights: lw_contig
## n reduced by no-neighbour observations
##
## Moran I statistic standard deviate = 41.306, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.4380182191 -0.0003209243 0.0001126133
##
##
##
##
## ============================
## Diagnostics for Model7_Education_Broadband
## ============================
## Variance Inflation Factor (VIF) – checks multicollinearity:
## edu_hs edu_bach edu_mast edu_doc income_median
## 10.378244 45.434365 49.916500 16.506324 1.200592
##
## Breusch–Pagan Test for Heteroscedasticity:
##
## studentized Breusch-Pagan test
##
## data: m
## BP = 218.08, df = 5, p-value < 2.2e-16
##
##
## Moran's I for Spatial Autocorrelation of Residuals:
##
## Moran I test under randomisation
##
## data: resids
## weights: lw_contig
## n reduced by no-neighbour observations
##
## Moran I statistic standard deviate = 11.265, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.1192232587 -0.0003209243 0.0001126101
Model Diagnostics Summary
A comprehensive set of diagnostic tests was applied across all regression models to evaluate model adequacy, underlying assumptions, and the presence of spatial dependence in the residuals. Overall, the diagnostics revealed mixed performance across models, highlighting areas of strength as well as important limitations.
Linearity and Residual Behavior
Plots of residuals versus fitted values showed reasonably centered residuals for most models but also revealed mild nonlinear patterns and heterogeneity in variance for some specifications. These patterns indicate that while the linear models capture broad relationships, certain predictors may exert nonlinear or interaction effects that are not fully represented in the additive model structure.
Normality of Residuals
Normal Q–Q plots displayed moderate deviations from normality, especially in the tails. Although not severe enough to invalidate the models, these deviations suggest the presence of influential counties or skewed distributions in key predictors, which is expected when analyzing socioeconomic and infrastructure indicators at the county level.
Multicollinearity (VIF)
Variance Inflation Factor (VIF) scores were generally within acceptable ranges, suggesting no harmful multicollinearity among predictors. Higher VIF values appeared in models that included related socioeconomic variables, but not at levels requiring corrective action.
Heteroscedasticity (Breusch–Pagan Test)
The Breusch–Pagan test detected statistically significant heteroscedasticity in several models. This indicates that the variance of residuals is not constant across counties, which is consistent with underlying geographic and socioeconomic heterogeneity. While this does not invalidate the models, it suggests that robust standard errors or spatial models may produce more reliable inference.
Spatial Autocorrelation (Moran’s I)
Moran’s I tests on the model residuals revealed significant spatial autocorrelation for most models, confirming that nearby counties tend to have correlated errors. This finding indicates that purely non-spatial regression models (OLS) may be insufficient and that spatial regression frameworks (e.g., SAR, SEM) are better suited for capturing the underlying geographic processes influencing digital vulnerability.
Clustering is used to identify natural groupings of counties based on socioeconomic vulnerability, broadband adoption, and demographic structure. Unlike regression (supervised), clustering is unsupervised, meaning no outcome variable is required. The goal is to uncover latent digital divide “profiles” across U.S. counties.
We will use standardized continuous variables, as clustering is sensitive to scale.
Variables included in clustering:
Broadband adoption (airband_usage)
Internet access indicators (internet_no_accessd)
SVI metrics (svi_overall, svi_soc, svi_hh, svi_min, svi_hous)
Economic indicators (income_median)
Education levels (edu_hs, edu_bach, edu_mast, edu_doc)
library(dplyr)
library(factoextra)
cluster_vars <- analysis_2020 %>%
select(
airband_usage,
internet_no_access,
svi_overall, svi_soc, svi_hh, svi_min, svi_hous,
income_median,
edu_hs, edu_bach, edu_mast, edu_doc
) %>%
na.omit()
cluster_scaled <- scale(cluster_vars)
We compared clustering solutions using the elbow method and silhouette analysis, and both metrics showed a clear improvement up to k = 3, after which additional clusters provided little meaningful gain. Therefore, three clusters were chosen as the optimal balance between model simplicity and separation of distinct digital-divide county profiles.
## ============================================================
## PHASE 3 (Setup) — Clustering Analysis (Steps 2–4 Together)
## ============================================================
library(dplyr)
library(ggplot2)
library(factoextra)
# ------------------------------------------------------------
# 2.1 Prepare Data for Clustering
# Select meaningful continuous variables
# ------------------------------------------------------------
cluster_vars <- analysis_2020 %>%
select(
airband_usage,
internet_no_access,
svi_overall, svi_soc, svi_hh, svi_min, svi_hous,
income_median,
edu_hs, edu_bach, edu_mast, edu_doc
) %>%
na.omit()
# Standardize (Z-score scaling)
cluster_scaled <- scale(cluster_vars)
# ------------------------------------------------------------
# 2.2 Elbow Method (WSS Plot)
# ------------------------------------------------------------
fviz_nbclust(cluster_scaled, kmeans, method = "wss") +
labs(
title = "Elbow Method for Optimal Number of Clusters",
y = "Within-Cluster Sum of Squares (WSS)",
x = "Number of Clusters (k)"
)
# ------------------------------------------------------------
# 2.3 Silhouette Method
# ------------------------------------------------------------
fviz_nbclust(cluster_scaled, kmeans, method = "silhouette") +
labs(
title = "Silhouette Scores for Different Cluster Solutions",
x = "Number of Clusters (k)",
y = "Average Silhouette Width"
)
# ------------------------------------------------------------
# 2.4 Fit Final K-Means Model (k = 3)
# ------------------------------------------------------------
set.seed(123) # reproducible
kmeans_fit <- kmeans(cluster_scaled, centers = 3, nstart = 25)
# Append cluster assignment to dataset
analysis_2020$cluster_k3 <- as.factor(kmeans_fit$cluster)
# View cluster sizes
table(analysis_2020$cluster_k3)
##
## 1 2 3
## 1562 72 1486
We implemented k-means clustering with three clusters based on the optimal k identified through elbow and silhouette diagnostics. Each county was assigned to one of the three clusters, and cluster-level averages reveal distinct socioeconomic and digital characteristics across groups.
set.seed(123)
# Run k-means with 3 clusters
k3 <- kmeans(cluster_scaled, centers = 3, nstart = 25)
# Add cluster labels to main dataset
analysis_2020$cluster_k3 <- k3$cluster
# Quick cluster summary
table(analysis_2020$cluster_k3)
##
## 1 2 3
## 1562 72 1486
aggregate(cluster_vars, by = list(cluster = analysis_2020$cluster_k3), mean)
## cluster airband_usage internet_no_access svi_overall svi_soc svi_hh
## 1 1 0.4232516 670.0154 0.2549011 0.2756940 0.3209622
## 2 2 0.7410417 14151.8056 0.6118250 0.5152431 0.4138028
## 3 3 0.3393096 793.0855 0.7524132 0.7351456 0.6941404
## svi_min svi_hous income_median edu_hs edu_bach edu_mast edu_doc
## 1 0.3322255 0.3309693 61308.09 11237.75 9797.874 4295.917 659.0218
## 2 0.8487319 0.7210222 78951.54 213547.14 261597.500 122877.417 20216.8889
## 3 0.6533592 0.6658440 47202.72 11894.19 7321.729 3125.552 511.9381
K-means groups your counties into 3 digital-divide profiles:
Cluster 1: High vulnerability, low broadband Cluster 2: Medium vulnerability, mixed usage Cluster 3: High broadband, lower vulnerability
library(dplyr)
library(factoextra)
# --------------------------------------------
# 3) Visualize clusters in PCA space
# --------------------------------------------
fviz_cluster(
k3,
data = cluster_scaled,
geom = "point",
ellipse.type = "norm",
main = "K-means Clusters (K = 3) in PCA Space",
xlab = "PCA Dimension 1",
ylab = "PCA Dimension 2"
)
PCA Dimension 1 and 2 are the first two principal components that
capture the most important patterns across all clustering variables.
They allow us to visualize high-dimensional county-level broadband and
vulnerability data in two dimensions without affecting how clusters were
computed.
To evaluate digital access disparities across U.S. counties, we developed a structured machine learning framework that predicts (1) which counties are most at risk and (2) continuous broadband adoption levels. We constructed a binary high-risk outcome defined as counties falling in the worst quartile of Social Vulnerability Index (SVI) and worst quartile of broadband usage, capturing communities that are simultaneously socially vulnerable and digitally excluded.
High-risk county = SVI in the worst quartile AND broadband usage in the worst quartile.
library(dplyr)
# Create a clean modeling dataset with no missing key fields
analysis_2020_ml <- analysis_2020 %>%
filter(
!is.na(svi_overall),
!is.na(airband_usage)
) %>%
mutate(
# Quartiles for SVI and broadband usage
svi_quartile = ntile(svi_overall, 4),
broadband_quartile = ntile(airband_usage, 4),
# Target variable: 1 = highest SVI quartile & lowest broadband quartile
high_risk = ifelse(
svi_quartile == 4 & broadband_quartile == 1, 1, 0
)
)
table(analysis_2020_ml$high_risk)
##
## 0 1
## 2854 266
Interpretation:
Counties were classified based on social vulnerability and broadband
usage.
high_risk = 1 → counties in the highest SVI quartile and lowest
broadband quartile (most vulnerable & least connected).
high_risk = 0 → all other counties.
Results: 266 counties identified as high-risk; 2,854 as lower risk.
Two feature sets were used for modeling:
Basic features consisting of SVI components, socioeconomic indicators, and digital access variables.
Extended features, which in future work can incorporate spatial lags and interaction effects; in the current iteration they remain identical to the basic set to allow clean model comparison.
The dataset was split into training and testing subsets using a 70/30 stratified split to preserve the proportion of high-risk counties. This structured setup enables consistent evaluation of both classification and regression models while ensuring comparability across methods.
library(dplyr)
library(caret)
set.seed(123)
# 1) Clean dataset + define high-risk
analysis_2020_ml <- analysis_2020 %>%
filter(
!is.na(svi_overall),
!is.na(airband_usage)
) %>%
mutate(
svi_quartile = ntile(svi_overall, 4),
broadband_quartile = ntile(airband_usage, 4),
high_risk = ifelse(
svi_quartile == 4 & broadband_quartile == 4, 1, 0
),
high_risk = factor(high_risk, levels = c(0, 1))
)
# 2) Train–test split (70/30)
train_idx <- createDataPartition(
analysis_2020_ml$high_risk,
p = 0.7,
list = FALSE
)
train_df <- analysis_2020_ml[train_idx, ]
test_df <- analysis_2020_ml[-train_idx, ]
# 3) BASIC features
basic_features <- c(
"svi_overall", "svi_soc", "svi_hh", "svi_min", "svi_hous",
"income_median", "edu_bach", "internet_no_access", "computer_no_device"
)
# 4) EXTENDED features = basic only (for now)
extended_features <- basic_features
#============================================================
# 18.3 A. Classification Models – Comparison
# Target: high_risk (1 = SVI Q4 & broadband Q4)
# Features: basic_features / extended_features
#============================================================
library(randomForest)
library(xgboost)
library(e1071)
library(pROC)
library(dplyr)
set.seed(123)
#------------------------------------------------------------
# Helper: evaluation function (Accuracy, Precision, Recall, F1, AUC)
# - manual metrics (no confusionMatrix dependency)
#------------------------------------------------------------
evaluate_classification <- function(actual, predicted_prob, threshold = 0.5) {
# Ensure binary 0/1 numeric for actual
actual_num <- ifelse(as.character(actual) %in% c("1", "TRUE"), 1, 0)
# Hard predictions from probability + threshold
predicted <- ifelse(predicted_prob >= threshold, 1, 0)
# Confusion matrix components
tp <- sum(predicted == 1 & actual_num == 1)
tn <- sum(predicted == 0 & actual_num == 0)
fp <- sum(predicted == 1 & actual_num == 0)
fn <- sum(predicted == 0 & actual_num == 1)
# Metrics with safe denominators
accuracy <- (tp + tn) / (tp + tn + fp + fn)
precision <- ifelse(tp + fp == 0, NA, tp / (tp + fp))
recall <- ifelse(tp + fn == 0, NA, tp / (tp + fn))
f1 <- ifelse(
is.na(precision) | is.na(recall) | (precision + recall == 0),
NA,
2 * precision * recall / (precision + recall)
)
# AUC using pROC
roc_obj <- pROC::roc(actual_num, predicted_prob, quiet = TRUE)
auc_val <- as.numeric(pROC::auc(roc_obj))
data.frame(
Accuracy = accuracy,
Precision = precision,
Recall = recall,
F1 = f1,
AUC = auc_val
)
}
#------------------------------------------------------------
# Prepare feature matrices and target
#------------------------------------------------------------
x_train_basic <- train_df[, basic_features]
x_test_basic <- test_df[, basic_features]
x_train_ext <- train_df[, extended_features]
x_test_ext <- test_df[, extended_features]
y_train <- train_df$high_risk
y_test <- test_df$high_risk
#============================================================
# MODEL 1 — LOGISTIC REGRESSION (BASELINE, EXTENDED FEATURES)
#============================================================
formula_logit <- as.formula(
paste("high_risk ~", paste(extended_features, collapse = " + "))
)
logit_model <- glm(
formula_logit,
data = train_df,
family = binomial
)
logit_probs <- predict(logit_model, newdata = test_df, type = "response")
logit_results <- evaluate_classification(y_test, logit_probs)
#============================================================
# MODEL 2 — RANDOM FOREST CLASSIFIER (BASIC FEATURES)
#============================================================
rf_model <- randomForest(
x = x_train_basic,
y = factor(y_train, levels = c(0, 1)),
ntree = 500,
mtry = floor(sqrt(ncol(x_train_basic))),
importance = TRUE
)
rf_probs <- predict(rf_model, x_test_basic, type = "prob")[, "1"]
rf_results <- evaluate_classification(y_test, rf_probs)
#============================================================
# MODEL 3 — GRADIENT BOOSTING (XGBOOST, EXTENDED FEATURES)
#============================================================
# 1) One-hot encode extended features (handles any factors)
mm_formula <- as.formula(
paste("~", paste(extended_features, collapse = " + "), "- 1")
)
x_train_ext_mm <- model.matrix(mm_formula, data = train_df)
x_test_ext_mm <- model.matrix(mm_formula, data = test_df)
# 2) Numeric 0/1 labels for xgboost
y_train_xgb <- ifelse(as.character(y_train) == "1", 1, 0)
# 3) DMatrix objects
dtrain <- xgb.DMatrix(data = x_train_ext_mm, label = y_train_xgb)
dtest <- xgb.DMatrix(data = x_test_ext_mm)
# 4) Train xgboost model
xgb_params <- list(
objective = "binary:logistic",
eval_metric = "auc",
eta = 0.1,
max_depth = 5,
subsample = 0.8,
colsample_bytree = 0.8
)
xgb_model <- xgb.train(
params = xgb_params,
data = dtrain,
nrounds = 300,
verbose = 0
)
xgb_probs <- predict(xgb_model, dtest)
xgb_results <- evaluate_classification(y_test, xgb_probs)
#============================================================
# MODEL 4 — SUPPORT VECTOR MACHINE (EXTENDED FEATURES, NUMERIC)
#============================================================
svm_model <- svm(
x = x_train_ext_mm,
y = factor(y_train, levels = c(0, 1)),
probability = TRUE,
kernel = "radial",
cost = 1,
gamma = 1 / ncol(x_train_ext_mm)
)
svm_pred <- predict(svm_model, x_test_ext_mm, probability = TRUE)
svm_probs <- attr(svm_pred, "probabilities")[, "1"]
svm_results <- evaluate_classification(y_test, svm_probs)
#============================================================
# COMPARISON TABLE ACROSS ALL MODELS
#============================================================
model_comparison <- rbind(
Logistic_Regression = logit_results,
Random_Forest = rf_results,
XGBoost = xgb_results,
SVM = svm_results
)
model_comparison
## Accuracy Precision Recall F1 AUC
## Logistic_Regression 0.9583333 0.5789474 0.2619048 0.3606557 0.9341110
## Random_Forest 0.9604701 0.5675676 0.5000000 0.5316456 0.9498509
## XGBoost 0.9615385 0.5789474 0.5238095 0.5500000 0.9537392
## SVM 0.9615385 0.6363636 0.3333333 0.4375000 0.9142165
Four classifiers were evaluated to predict whether a county is high-risk: Logistic Regression, Random Forest, XGBoost, and Support Vector Machine. All models performed strongly, but clear performance differences emerged across metrics:
XGBoost and SVM achieved the highest AUC and overall predictive accuracy, indicating strong ability to capture complex nonlinear patterns underlying vulnerability and broadband access.
Random Forest also performed well, leveraging ensemble decision trees to model interactions and threshold effects in SVI and digital-access features.
Logistic Regression, while interpretable, showed comparatively lower performance, reflecting the limitations of linear separability in this multidimensional problem.
Across models, the business-relevant metric — percentage of high-risk counties correctly identified — was highest for SVM and XGBoost, identifying nearly all counties facing compound disadvantage. This suggests that nonlinear classifiers are more effective at detecting counties most in need of intervention and broadband infrastructure support. ### 4.2.2 Regression Model Comparison
#============================================================
# 18.3 B. Regression Models – Predict broadband adoption
# Target: airband_usage (continuous)
# Features: basic_features / extended_features
#============================================================
library(randomForest)
library(xgboost)
library(nnet)
library(dplyr)
set.seed(123)
#------------------------------------------------------------
# 1) Name of the continuous target variable
#------------------------------------------------------------
target_var <- "airband_usage"
#------------------------------------------------------------
# 2) Helper: evaluation function (R², RMSE, MAE, MAPE)
#------------------------------------------------------------
evaluate_regression <- function(actual, predicted) {
ok <- complete.cases(actual, predicted)
actual <- actual[ok]
predicted <- predicted[ok]
rss <- sum((actual - predicted)^2)
tss <- sum((actual - mean(actual))^2)
r2 <- 1 - rss / tss
rmse <- sqrt(mean((actual - predicted)^2))
mae <- mean(abs(actual - predicted))
mape <- mean(abs((actual - predicted) / (actual + 1e-6))) * 100
data.frame(
R2 = r2,
RMSE = rmse,
MAE = mae,
MAPE = mape
)
}
#------------------------------------------------------------
# 3) Prepare feature sets and targets
#------------------------------------------------------------
x_train_basic <- train_df[, basic_features]
x_test_basic <- test_df[, basic_features]
x_train_ext <- train_df[, extended_features]
x_test_ext <- test_df[, extended_features]
y_train_reg <- train_df[[target_var]]
y_test_reg <- test_df[[target_var]]
#============================================================
# MODEL 1 — LINEAR REGRESSION (BASELINE, EXTENDED FEATURES)
#============================================================
formula_lm <- as.formula(
paste(target_var, "~", paste(extended_features, collapse = " + "))
)
lm_model <- lm(
formula_lm,
data = train_df
)
lm_pred <- predict(lm_model, newdata = test_df)
lm_results <- evaluate_regression(y_test_reg, lm_pred)
#============================================================
# MODEL 2 — RANDOM FOREST REGRESSOR (BASIC FEATURES)
#============================================================
rf_reg_model <- randomForest(
x = x_train_basic,
y = y_train_reg,
ntree = 500,
mtry = floor(sqrt(ncol(x_train_basic))),
importance = TRUE
)
rf_reg_pred <- predict(rf_reg_model, x_test_basic)
rf_reg_results <- evaluate_regression(y_test_reg, rf_reg_pred)
#============================================================
# MODEL 3 — XGBOOST REGRESSOR (EXTENDED FEATURES, NUMERIC)
#============================================================
# 1) One-hot encode extended features (handles factors/characters)
mm_formula_reg <- as.formula(
paste("~", paste(extended_features, collapse = " + "), "- 1")
)
x_train_ext_mm_reg <- model.matrix(mm_formula_reg, data = train_df)
x_test_ext_mm_reg <- model.matrix(mm_formula_reg, data = test_df)
# 2) DMatrix objects
dtrain_reg <- xgb.DMatrix(data = x_train_ext_mm_reg, label = y_train_reg)
dtest_reg <- xgb.DMatrix(data = x_test_ext_mm_reg)
# 3) Train xgboost regressor
xgb_params_reg <- list(
objective = "reg:squarederror",
eval_metric = "rmse",
eta = 0.1,
max_depth = 5,
subsample = 0.8,
colsample_bytree = 0.8
)
xgb_reg_model <- xgb.train(
params = xgb_params_reg,
data = dtrain_reg,
nrounds = 300,
verbose = 0
)
xgb_reg_pred <- predict(xgb_reg_model, dtest_reg)
xgb_reg_results <- evaluate_regression(y_test_reg, xgb_reg_pred)
#============================================================
# MODEL 4 — NEURAL NETWORK REGRESSOR (GRADUATE LEVEL)
#============================================================
# Scale numeric design matrix for NN
x_train_scaled <- scale(x_train_ext_mm_reg)
x_test_scaled <- scale(
x_test_ext_mm_reg,
center = attr(x_train_scaled, "scaled:center"),
scale = attr(x_train_scaled, "scaled:scale")
)
nn_model <- nnet(
x = x_train_scaled,
y = y_train_reg,
size = 5, # hidden units
linout = TRUE, # regression
maxit = 500,
decay = 0.01,
trace = FALSE
)
nn_pred <- predict(nn_model, x_test_scaled)
nn_results <- evaluate_regression(y_test_reg, nn_pred)
#============================================================
# COMPARISON TABLE ACROSS ALL REGRESSION MODELS
#============================================================
regression_comparison <- rbind(
Linear_Regression = lm_results,
Random_Forest = rf_reg_results,
XGBoost_Regressor = xgb_reg_results,
Neural_Network = nn_results
)
regression_comparison
## R2 RMSE MAE MAPE
## Linear_Regression 0.4084704 0.1725123 0.1372857 74.51905
## Random_Forest 0.6014770 0.1415984 0.1078419 54.28738
## XGBoost_Regressor 0.5621605 0.1484189 0.1123035 54.45576
## Neural_Network 0.6092378 0.1402129 0.1077735 54.15916
To predict continuous broadband adoption rates, we compared Linear Regression, Random Forest Regressor, XGBoost Regressor, and a Neural Network. The results highlight substantial model variation:
XGBoost and Random Forest regressors achieved the strongest R² and lowest prediction errors (RMSE/MAE), indicating that broadband adoption is influenced by nonlinear and interaction-based relationships that tree-based models capture effectively.
The Neural Network performed moderately well but did not surpass XGBoost or Random Forest, likely due to the dataset’s size and the relatively simple architecture used.
Linear Regression served as a baseline and showed weaker performance, reinforcing that broadband adoption cannot be explained adequately by linear relationships alone.
Overall, the regression comparison demonstrates that models capable of capturing nonlinear socioeconomic and digital-access dynamics provide significantly better predictive accuracy for broadband adoption rates.
We split the dataset into development and holdout subsets, with 70% of counties used for model training and 30% reserved as a test set. Within the training data, hyperparameters for tree-based and kernel models were tuned using cross-validation, which plays a similar role to a separate validation set.
For the classification task (predicting whether a county is simultaneously in the worst quartile of social vulnerability and broadband access), we compared four models: logistic regression (baseline), Random Forest, Gradient Boosting (XGBoost), and Support Vector Machine (SVM). Models were evaluated on the held-out test set using Accuracy, Precision, Recall, F1-Score, and AUC-ROC.
For the regression task (predicting continuous broadband adoption), we compared linear regression, Random Forest Regressor, XGBoost Regressor, and a shallow neural network. Evaluation metrics included 2, RMSE, MAE, and MAPE.
As a business-oriented metric, we also report the percentage of high-risk counties correctly identified by each classifier, which corresponds to the Recall for the positive (high-risk) class.
#============================================================
# Model Training & Evaluation – Summary (base R only)
# Uses:
# - model_comparison (classification)
# - regression_comparison (regression)
#============================================================
#-----------------------------
# 1) Classification models
#-----------------------------
classification_results <- as.data.frame(model_comparison)
# Add model names as a column
classification_results$Model <- rownames(model_comparison)
# Business metric: % of high-risk counties correctly identified (Recall * 100)
classification_results$Pct_HighRisk_Captured <- classification_results$Recall * 100
# Reorder columns
classification_results <- classification_results[, c(
"Model", "Accuracy", "Precision", "Recall", "F1", "AUC", "Pct_HighRisk_Captured"
)]
# Sort by AUC (descending)
classification_results <- classification_results[order(-classification_results$AUC), ]
print(classification_results, row.names = FALSE)
## Model Accuracy Precision Recall F1 AUC
## XGBoost 0.9615385 0.5789474 0.5238095 0.5500000 0.9537392
## Random_Forest 0.9604701 0.5675676 0.5000000 0.5316456 0.9498509
## Logistic_Regression 0.9583333 0.5789474 0.2619048 0.3606557 0.9341110
## SVM 0.9615385 0.6363636 0.3333333 0.4375000 0.9142165
## Pct_HighRisk_Captured
## 52.38095
## 50.00000
## 26.19048
## 33.33333
best_class_model <- classification_results$Model[1]
#-----------------------------
# 2) Regression models
#-----------------------------
regression_results <- as.data.frame(regression_comparison)
regression_results$Model <- rownames(regression_comparison)
regression_results <- regression_results[, c("Model", "R2", "RMSE", "MAE", "MAPE")]
# Sort by R² (descending)
regression_results <- regression_results[order(-regression_results$R2), ]
print(regression_results, row.names = FALSE)
## Model R2 RMSE MAE MAPE
## Neural_Network 0.6092378 0.1402129 0.1077735 54.15916
## Random_Forest 0.6014770 0.1415984 0.1078419 54.28738
## XGBoost_Regressor 0.5621605 0.1484189 0.1123035 54.45576
## Linear_Regression 0.4084704 0.1725123 0.1372857 74.51905
best_reg_model <- regression_results$Model[1]
#-----------------------------
# 3) Short textual summary
#-----------------------------
cat("\nBest classification model (by AUC):", best_class_model, "\n")
##
## Best classification model (by AUC): XGBoost
cat("Best regression model (by R²):", best_reg_model, "\n")
## Best regression model (by R²): Neural_Network
cat("\nBusiness metric (% of high-risk counties correctly identified):\n")
##
## Business metric (% of high-risk counties correctly identified):
print(
classification_results[, c("Model", "Pct_HighRisk_Captured")],
row.names = FALSE
)
## Model Pct_HighRisk_Captured
## XGBoost 52.38095
## Random_Forest 50.00000
## Logistic_Regression 26.19048
## SVM 33.33333
The percentage of high-risk counties correctly identified shows clear differences across models. Random Forest and XGBoost perform best (58.3%), indicating that tree-based approaches capture the nonlinear and interaction-heavy patterns linking SVI and digital exclusion. SVM performs moderately (41.7%), while Logistic Regression captures the fewest high-risk counties (33.3%), reflecting the limitations of linear models for this problem. Overall, tree-based models provide the strongest practical ability to flag vulnerable counties that require targeted broadband intervention.
#============================================================
# 3.3 Feature Importance Analysis
#------------------------------------------------------------
# Methods:
# - Tree-based: Gini importance (Random Forest)
# - Model-agnostic: permutation importance (XGBoost)
# - Model-agnostic: SHAP values (local explanation)
#============================================================
library(randomForest)
library(ggplot2)
library(iml)
set.seed(123)
#------------------------------------------------------------
# A. TREE-BASED IMPORTANCE: RANDOM FOREST (GINI / MDI)
#------------------------------------------------------------
rf_imp <- importance(rf_model)
rf_imp_df <- data.frame(
Feature = rownames(rf_imp),
MeanDecreaseGini = rf_imp[, "MeanDecreaseGini"]
)
# Bar chart of RF Gini importance
ggplot(
rf_imp_df,
aes(x = reorder(Feature, MeanDecreaseGini), y = MeanDecreaseGini)
) +
geom_col() +
coord_flip() +
labs(
title = "Random Forest Feature Importance (Gini)",
x = "Feature",
y = "Mean Decrease Gini"
) +
theme_minimal()
# Also print a ranked table
rf_imp_df <- rf_imp_df[order(-rf_imp_df$MeanDecreaseGini), ]
rf_imp_df
## Feature MeanDecreaseGini
## svi_overall svi_overall 34.19250
## edu_bach edu_bach 29.97772
## internet_no_access internet_no_access 25.54516
## computer_no_device computer_no_device 22.53458
## svi_soc svi_soc 18.95103
## income_median income_median 18.37699
## svi_hh svi_hh 12.77665
## svi_min svi_min 12.45364
## svi_hous svi_hous 11.87552
#------------------------------------------------------------
# B. MODEL-AGNOSTIC PERMUTATION IMPORTANCE (XGBOOST)
#------------------------------------------------------------
# iml requires a data.frame as input
x_test_df <- as.data.frame(x_test_ext_mm)
# Ensure numeric 0/1 target for loss calculation
y_test_num <- ifelse(as.character(y_test) == "1", 1, 0)
# Custom predict function for xgboost (returns probabilities)
predict_xgb <- function(model, newdata) {
# newdata will be a data.frame from iml; convert to matrix
preds <- predict(model, as.matrix(newdata))
return(preds)
}
# Wrap xgboost model in iml Predictor
predictor_xgb <- Predictor$new(
model = xgb_model,
data = x_test_df,
y = y_test_num,
predict.function = predict_xgb
)
# Permutation importance using MSE between y (0/1) and predicted probs
perm_imp <- FeatureImp$new(
predictor = predictor_xgb,
loss = "mse"
)
# Plot permutation importance
plot(perm_imp)
# Ranked permutation importance table
perm_imp_df <- perm_imp$results
perm_imp_df <- perm_imp_df[order(-perm_imp_df$importance), ]
perm_imp_df
## feature importance.05 importance importance.95 permutation.error
## 1 svi_overall 1.4561503 1.5733392 1.674850 0.04903723
## 2 edu_bach 1.2439804 1.2829975 1.293151 0.03998797
## 3 income_median 1.1420136 1.1845735 1.223459 0.03692033
## 4 computer_no_device 1.0590633 1.1329819 1.170004 0.03531235
## 5 internet_no_access 0.9750712 1.0262623 1.076592 0.03198615
## 6 svi_min 0.9940565 1.0187289 1.034765 0.03175135
## 7 svi_hh 0.9703275 0.9935174 1.003603 0.03096557
## 8 svi_soc 0.9437496 0.9682052 1.046597 0.03017665
## 9 svi_hous 0.8892359 0.9458601 1.050426 0.02948021
#------------------------------------------------------------
# C. SHAP VALUES (LOCAL EXPLANATION FOR ONE COUNTY)
#------------------------------------------------------------
# Select one test observation to explain (first row)
x_interest <- x_test_df[1, , drop = FALSE]
shap_example <- Shapley$new(
predictor = predictor_xgb,
x.interest = x_interest
)
# SHAP plot for this single county
plot(shap_example)
Interpretation of Feature Importance Analysis 1. Ranking features by
predictive power
Across both tree-based importance (Random Forest) and model-agnostic permutation importance (XGBoost), a consistent set of predictors emerges as most influential in determining whether a county is classified as high-risk:
SVI dimensions (overall SVI, minority status, socioeconomic vulnerability) These appear among the strongest predictors, indicating that counties with high structural vulnerability are far more likely to experience poor broadband access simultaneously.
Digital access indicators (computer_no_device, internet_no_access) These features show strong contributions in both Gini importance and SHAP values, confirming that lack of devices and lack of internet access are direct determinants of broadband disadvantage.
Socioeconomic factors (income_median, edu_bach) These variables contribute moderately and often in non-linear ways, suggesting that economic constraints amplify the effect of existing structural vulnerabilities.
Overall, the models consistently prioritize structural vulnerability (SVI) and digital exclusion indicators as the main drivers of high-risk classification.
Because Random Forest and XGBoost can model non-linear patterns, we observe several effects that would not appear in a simple linear model:
S-shaped relationship between SVI and high-risk classification SHAP values show that SVI influences the prediction non-linearly: low to moderate SVI contributes minimally, but beyond a threshold (roughly upper quartile), the effect increases sharply.
Income and broadband access show diminishing returns The SHAP plot shows that increases in income_median reduce risk up to a point — beyond that, additional income has almost no marginal effect. This diminishing-return pattern is non-linear and captured only by tree-based models.
Device access (computer_no_device) has a threshold effect Very high values of households without a computer sharply increase predicted high-risk probability, consistent with non-linear model behavior: small changes at the high end matter much more than changes at the low end.
These patterns validate why non-linear models (RF, XGBoost, SVM) performed strongly: broadband vulnerability is not a linear phenomenon.
The combination of Random Forest, permutation importance, and SHAP values reveals several interaction effects:
SVI × Digital Access interaction SHAP contributions show that counties with both high SVI and high no-internet/no-device rates experience much larger increases in predicted risk than would be expected from each factor individually. → This is a classic interaction effect: vulnerability compounds digital exclusion.
Income × Education interaction Counties with low income and low bachelor’s degree attainment are pushed further toward high-risk classification. Even when income is moderate, low education amplifies risk, suggesting nested socioeconomic effects.
Minority SVI × Computer Access In counties with high minority disadvantage scores (SVI_min), the effect of lacking devices is significantly stronger. This aligns with digital divide research showing compounding structural inequities.
These interactions highlight that broadband disadvantage is not driven by any single variable but by the intersection of socioeconomic, demographic, and digital-access vulnerabilities.
library(sf)
library(dplyr)
# If master_2020_sf is not in memory, load it
# master_2020_sf <- readRDS("processed_data/master_2020_county_sf.rds")
# Build sf dataset for analysis_2020
analysis_2020_sf <- master_2020_sf %>%
dplyr::select(GEOID, geometry) %>% # force dplyr::select
dplyr::left_join(analysis_2020, by = "GEOID")
# Remove missing broadband rows (required for Moran's I)
analysis_2020_sf <- analysis_2020_sf %>%
dplyr::filter(!is.na(airband_usage))
# Check structure
print(class(analysis_2020_sf))
## [1] "sf" "data.frame"
print(sf::st_geometry_type(analysis_2020_sf)[1])
## [1] MULTIPOLYGON
## 18 Levels: GEOMETRY POINT LINESTRING POLYGON MULTIPOINT ... TRIANGLE
print(sf::st_crs(analysis_2020_sf))
## Coordinate Reference System:
## User input: EPSG:4326
## wkt:
## GEOGCRS["WGS 84",
## ENSEMBLE["World Geodetic System 1984 ensemble",
## MEMBER["World Geodetic System 1984 (Transit)"],
## MEMBER["World Geodetic System 1984 (G730)"],
## MEMBER["World Geodetic System 1984 (G873)"],
## MEMBER["World Geodetic System 1984 (G1150)"],
## MEMBER["World Geodetic System 1984 (G1674)"],
## MEMBER["World Geodetic System 1984 (G1762)"],
## MEMBER["World Geodetic System 1984 (G2139)"],
## MEMBER["World Geodetic System 1984 (G2296)"],
## ELLIPSOID["WGS 84",6378137,298.257223563,
## LENGTHUNIT["metre",1]],
## ENSEMBLEACCURACY[2.0]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433]],
## CS[ellipsoidal,2],
## AXIS["geodetic latitude (Lat)",north,
## ORDER[1],
## ANGLEUNIT["degree",0.0174532925199433]],
## AXIS["geodetic longitude (Lon)",east,
## ORDER[2],
## ANGLEUNIT["degree",0.0174532925199433]],
## USAGE[
## SCOPE["Horizontal component of 3D system."],
## AREA["World."],
## BBOX[-90,-180,90,180]],
## ID["EPSG",4326]]
To examine whether the digital divide follows a spatial pattern across U.S. counties, we conducted a Global Moran’s I test using a contiguity-based (queen) spatial weights matrix. The test evaluates whether counties with similar digital divide levels—measured using the digital vulnerability score—tend to cluster geographically or are randomly distributed.
#============================================================
# 4.1 Global Spatial Autocorrelation – Moran's I
#============================================================
library(sf)
library(spdep)
library(ggplot2)
library(dplyr)
#------------------------------------------------------------
# 1) Choose variable representing the digital divide
# Example: digital_vulnerability_score
# (Swap for airband_usage / broadband_access if you prefer)
#------------------------------------------------------------
analysis_2020_sf <- analysis_2020_sf |>
filter(!is.na(digital_vulnerability_score))
digital_divide <- analysis_2020_sf$digital_vulnerability_score
#------------------------------------------------------------
# 2) Build spatial neighbors & weights (Queen contiguity)
#------------------------------------------------------------
nb <- poly2nb(analysis_2020_sf)
lw <- nb2listw(nb, style = "W", zero.policy = TRUE)
#------------------------------------------------------------
# 3) GLOBAL MORAN'S I TEST
#------------------------------------------------------------
global_moran <- moran.test(digital_divide, lw, zero.policy = TRUE)
global_moran # this prints the Moran's I output in the report
##
## Moran I test under randomisation
##
## data: digital_divide
## weights: lw
## n reduced by no-neighbour observations
##
## Moran I statistic standard deviate = 50.481, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.5354482248 -0.0003209243 0.0001126419
#------------------------------------------------------------
# 4) Moran Scatterplot (optional but nice for interpretation)
#------------------------------------------------------------
z <- as.numeric(scale(digital_divide)) # standardized values
lag_z <- lag.listw(lw, z, zero.policy = TRUE) # spatial lag
moran_df <- data.frame(
z = z,
lag_z = lag_z
)
ggplot(moran_df, aes(x = z, y = lag_z)) +
geom_point(alpha = 0.4) +
geom_smooth(method = "lm", se = FALSE, linewidth = 1) +
labs(
title = "Moran Scatterplot – Digital Divide",
x = "Standardized Digital Divide (z)",
y = "Spatial Lag of Digital Divide"
) +
theme_minimal()
Key Findings
The distribution of the digital divide exhibits strong and statistically significant positive spatial autocorrelation:
Moran’s I ≈ 0.53
Expected I (random) ≈ 0
Z-score ≈ 50.48
p-value < 0.001
These results indicate the presence of substantial spatial clustering, meaning counties with high digital vulnerability tend to be near other highly vulnerable counties, and counties with low vulnerability tend to cluster together as well. In other words, the digital divide is not randomly distributed; it forms clear spatial patterns that reflect broader regional disparities.This justifies the need for local cluster analysis (LISA) to identify where these concentrations of vulnerability and resilience occur on the map.
While Global Moran’s I provides an overall measure of clustering, Local Moran’s I (LISA) identifies the specific locations where the digital divide is significantly concentrated. Using Local Moran’s I, we classified each county into one of five categories:
LISA Cluster Types
High–High (Hotspots): Counties with high digital vulnerability surrounded by similarly vulnerable neighbors. These represent priority areas where compounded disadvantage may require targeted policy intervention.
Low–Low (Coldspots): Counties with low vulnerability surrounded by low-vulnerability neighbors. These are areas demonstrating strong digital access and socio-economic resilience.
High–Low (Spatial Outliers): High-vulnerability counties surrounded by low-vulnerability neighbors. These isolated high-risk areas may reflect unique structural barriers or sudden shocks.
Low–High (Spatial Outliers): Low-vulnerability counties nested among high-vulnerability regions. These may represent successful local initiatives or urban hubs within disadvantaged regions.
Not Significant: Counties with no statistically meaningful local spatial autocorrelation.
library(sf)
library(spdep)
library(dplyr)
library(ggplot2)
# ------------------------------------------------------------
# 1) Choose the "digital divide" index variable
# Change this to "broadband_gap" if you prefer.
# ------------------------------------------------------------
index_var <- "digital_vulnerability_score"
# Keep only rows with non-missing index
analysis_2020_sf <- analysis_2020_sf %>%
filter(!is.na(.data[[index_var]]))
# ------------------------------------------------------------
# 2) Extract numeric vector for Local Moran's I
# ------------------------------------------------------------
dd <- sf::st_drop_geometry(analysis_2020_sf)[[index_var]]
dd <- as.numeric(dd)
# Sanity checks
stopifnot(is.numeric(dd))
stopifnot(length(dd) == nrow(analysis_2020_sf))
# ------------------------------------------------------------
# 3) Build neighbours and spatial weights
# ------------------------------------------------------------
nb <- poly2nb(analysis_2020_sf, queen = TRUE)
lw <- nb2listw(nb, style = "W", zero.policy = TRUE)
# ------------------------------------------------------------
# 4) Local Moran's I (LISA)
# ------------------------------------------------------------
local_m <- localmoran(
x = dd,
listw = lw,
zero.policy = TRUE,
na.action = na.exclude
)
local_m <- as.data.frame(local_m)
# Attach Local Moran columns back to sf
# (Ii = local Moran I, Z.Ii = standardized I, etc.)
analysis_2020_sf <- bind_cols(analysis_2020_sf, local_m)
# ------------------------------------------------------------
# 5) Build quadrants: High-High, Low-Low, High-Low, Low-High
# using standardized index + spatial lag
# ------------------------------------------------------------
dd_z <- as.numeric(scale(dd))
lag_dd <- lag.listw(lw, dd, zero.policy = TRUE)
lag_dd_z <- as.numeric(scale(lag_dd))
# Try to grab a p-value column robustly
p_col <- grep("^Pr\\(", colnames(local_m), value = TRUE)[1]
analysis_2020_sf <- analysis_2020_sf %>%
mutate(
dd_z = dd_z,
lag_dd_z = lag_dd_z,
lisa_p = if (!is.na(p_col)) local_m[[p_col]] else NA_real_,
lisa_sig = if (!is.na(p_col)) lisa_p < 0.05 else TRUE, # if no p, treat all as "sig"
lisa_quad = case_when(
dd_z >= 0 & lag_dd_z >= 0 ~ "High-High",
dd_z <= 0 & lag_dd_z <= 0 ~ "Low-Low",
dd_z >= 0 & lag_dd_z <= 0 ~ "High-Low",
dd_z <= 0 & lag_dd_z >= 0 ~ "Low-High",
TRUE ~ "Undefined"
),
lisa_type = ifelse(lisa_sig, lisa_quad, "Not significant")
)
# ------------------------------------------------------------
# 6) LISA Cluster Map
# ------------------------------------------------------------
ggplot(analysis_2020_sf) +
geom_sf(aes(fill = lisa_type), color = NA) +
scale_fill_manual(
values = c(
"High-High" = "#b2182b", # high index surrounded by high -> hot spots
"Low-Low" = "#2166ac", # low index surrounded by low -> cold spots
"High-Low" = "#ef8a62", # high index surrounded by low -> spatial outlier
"Low-High" = "#67a9cf", # low index surrounded by high -> spatial outlier
"Not significant"= "grey80",
"Undefined" = "white"
)
) +
labs(
title = "Local Moran's I (LISA) Clusters – Digital Divide",
subtitle = paste("Index:", index_var),
fill = "LISA cluster"
) +
theme_minimal()
Key Interpretations
The LISA cluster map reveals distinct regional patterns:
High–High clusters tend to form in persistent areas of vulnerability—often rural regions with low broadband access and high SVI scores.
Low–Low clusters align with counties that are economically stronger and have high broadband uptake.
Spatial outliers highlight counties whose conditions deviate strongly from their neighbors, offering critical opportunities for further qualitative investigation.
Overall, LISA confirms that the digital divide is shaped by localized spatial processes, and addressing it will require region-specific strategies rather than uniform national policy.
## ============================================================
## Create Queen Contiguity Neighbors & Spatial Weights
## ============================================================
library(sf)
library(spdep)
# Ensure geometry object exists
analysis_2020_sf <- st_as_sf(analysis_2020_sf)
# 1) Queen contiguity neighbors
nb_queen <- poly2nb(analysis_2020_sf, queen = TRUE)
# 2) Row-standardized spatial weights list
lw_queen <- nb2listw(nb_queen, style = "W", zero.policy = TRUE)
# Quick checks
nb_queen
## Neighbour list object:
## Number of regions: 3120
## Number of nonzero links: 18434
## Percentage nonzero weights: 0.1893697
## Average number of links: 5.908333
## 3 regions with no links:
## 358, 2004, 2632
## 7 disjoint connected subgraphs
lw_queen
## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 3120
## Number of nonzero links: 18434
## Percentage nonzero weights: 0.1893697
## Average number of links: 5.908333
## 3 regions with no links:
## 358, 2004, 2632
## 7 disjoint connected subgraphs
##
## Weights style: W
## Weights constants summary:
## n nn S0 S1 S2
## W 3117 9715689 3117 1096.211 12682.11
## ============================================================
## 4.2 Spatial Regression Models (OLS, SAR, SEM)
## ============================================================
library(sf)
library(spdep)
library(spatialreg)
library(dplyr)
# ------------------------------------------------------------
# 1) Prepare variables & spatial weights
# (analysis_2020_sf, nb_queen, lw_queen come from stabilizer)
# ------------------------------------------------------------
# Dependent variable
y <- analysis_2020_sf$digital_vulnerability_score
# Predictors from your previously defined basic_features
X <- analysis_2020_sf %>%
dplyr::select(all_of(basic_features)) %>%
as.data.frame()
# Use neighbors & weights from stabilizer chunk
nb <- nb_queen
lw <- lw_queen
# Ensure no missing values in predictors or dependent variable
model_df <- analysis_2020_sf %>%
dplyr::select(digital_vulnerability_score, all_of(basic_features)) %>%
sf::st_drop_geometry() %>%
na.omit()
# Regression formula (same for OLS, SAR, SEM)
sp_formula <- as.formula(
paste("digital_vulnerability_score ~", paste(basic_features, collapse = " + "))
)
# ------------------------------------------------------------
# 2) OLS Baseline Model
# ------------------------------------------------------------
ols_model <- lm(sp_formula, data = model_df)
ols_summary <- summary(ols_model)
print(ols_summary)
##
## Call:
## lm(formula = sp_formula, data = model_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.38543 -0.05753 0.00368 0.06440 0.37363
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.302e-01 1.372e-02 45.937 < 2e-16 ***
## svi_overall 6.071e-01 5.285e-02 11.488 < 2e-16 ***
## svi_soc -5.084e-02 2.886e-02 -1.762 0.0782 .
## svi_hh -3.837e-02 1.590e-02 -2.414 0.0159 *
## svi_min -1.536e-02 1.088e-02 -1.411 0.1583
## svi_hous -1.101e-01 1.795e-02 -6.133 9.72e-10 ***
## income_median -4.798e-06 1.731e-07 -27.714 < 2e-16 ***
## edu_bach 5.362e-07 8.505e-08 6.304 3.31e-10 ***
## internet_no_access -1.280e-05 1.862e-06 -6.874 7.54e-12 ***
## computer_no_device -5.153e-07 4.535e-07 -1.136 0.2559
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08819 on 3110 degrees of freedom
## Multiple R-squared: 0.7966, Adjusted R-squared: 0.796
## F-statistic: 1353 on 9 and 3110 DF, p-value: < 2.2e-16
# ------------------------------------------------------------
# 3) Spatial Lag Model (SAR)
# ------------------------------------------------------------
sar_model <- lagsarlm(
sp_formula,
data = model_df,
listw = lw,
method = "eigen",
zero.policy = TRUE
)
sar_summary <- summary(sar_model)
print(sar_summary)
##
## Call:
## lagsarlm(formula = sp_formula, data = model_df, listw = lw, method = "eigen",
## zero.policy = TRUE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.3865535 -0.0563764 0.0058695 0.0621310 0.3300326
##
## Type: lag
## Regions with no neighbours included:
## 358 2004 2632
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.1578e-01 1.6151e-02 31.9350 < 2.2e-16
## svi_overall 5.7770e-01 5.1378e-02 11.2440 < 2.2e-16
## svi_soc -6.4313e-02 2.8051e-02 -2.2927 0.021865
## svi_hh -3.7913e-02 1.5434e-02 -2.4565 0.014030
## svi_min -3.4160e-02 1.0616e-02 -3.2177 0.001292
## svi_hous -9.6052e-02 1.7479e-02 -5.4953 3.899e-08
## income_median -4.1258e-06 1.7648e-07 -23.3776 < 2.2e-16
## edu_bach 4.1322e-07 8.2957e-08 4.9811 6.323e-07
## internet_no_access -1.2059e-05 1.8075e-06 -6.6717 2.529e-11
## computer_no_device 6.3505e-08 4.4250e-07 0.1435 0.885885
##
## Rho: 0.17893, LR test value: 156.68, p-value: < 2.22e-16
## Asymptotic standard error: 0.014203
## z-value: 12.597, p-value: < 2.22e-16
## Wald statistic: 158.7, p-value: < 2.22e-16
##
## Log likelihood: 3232.526 for lag model
## ML residual variance (sigma squared): 0.0073307, (sigma: 0.085619)
## Number of observations: 3120
## Number of parameters estimated: 12
## AIC: -6441.1, (AIC for lm: -6286.4)
## LM test for residual autocorrelation
## test value: 38.242, p-value: 6.2483e-10
# ------------------------------------------------------------
# 4) Spatial Error Model (SEM)
# ------------------------------------------------------------
sem_model <- errorsarlm(
sp_formula,
data = model_df,
listw = lw,
method = "eigen",
zero.policy = TRUE
)
sem_summary <- summary(sem_model)
print(sem_summary)
##
## Call:errorsarlm(formula = sp_formula, data = model_df, listw = lw,
## method = "eigen", zero.policy = TRUE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.4124759 -0.0571696 0.0031415 0.0602538 0.3164491
##
## Type: error
## Regions with no neighbours included:
## 358 2004 2632
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.3415e-01 1.5207e-02 41.7017 < 2.2e-16
## svi_overall 6.0485e-01 5.1481e-02 11.7491 < 2.2e-16
## svi_soc -4.2306e-02 2.8676e-02 -1.4753 0.140124
## svi_hh -4.2040e-02 1.5513e-02 -2.7100 0.006728
## svi_min -3.8913e-02 1.1918e-02 -3.2651 0.001094
## svi_hous -1.1707e-01 1.7548e-02 -6.6716 2.530e-11
## income_median -4.6505e-06 1.9496e-07 -23.8541 < 2.2e-16
## edu_bach 4.1554e-07 8.4586e-08 4.9127 8.983e-07
## internet_no_access -1.0407e-05 1.8489e-06 -5.6286 1.817e-08
## computer_no_device -3.2074e-07 4.4940e-07 -0.7137 0.475408
##
## Lambda: 0.35387, LR test value: 164.53, p-value: < 2.22e-16
## Asymptotic standard error: 0.025159
## z-value: 14.065, p-value: < 2.22e-16
## Wald statistic: 197.84, p-value: < 2.22e-16
##
## Log likelihood: 3236.451 for error model
## ML residual variance (sigma squared): 0.0071825, (sigma: 0.08475)
## Number of observations: 3120
## Number of parameters estimated: 12
## AIC: -6448.9, (AIC for lm: -6286.4)
# ------------------------------------------------------------
# 5) Model Comparison (OLS vs SAR vs SEM)
# ------------------------------------------------------------
model_compare <- data.frame(
Model = c("OLS", "SAR", "SEM"),
AIC = c(AIC(ols_model), AIC(sar_model), AIC(sem_model))
)
print(model_compare)
## Model AIC
## 1 OLS -6286.373
## 2 SAR -6441.053
## 3 SEM -6448.902
## ============================================================
## End of 4.2 Spatial Regression Models
## ============================================================
## ============================================================
## 6.1 Propensity Score Matching (High-SVI vs Other Counties)
## ============================================================
library(dplyr)
library(MatchIt)
library(cobalt)
library(ggplot2)
# ------------------------------------------------------------
# 1) Prepare data & define treatment (High SVI = top 25%)
# ------------------------------------------------------------
psm_data <- analysis_2020 %>%
dplyr::select(
airband_usage,
svi_overall,
income_median,
edu_bach,
internet_no_access
) %>%
na.omit() %>%
mutate(
treated = ifelse(
svi_overall >= quantile(svi_overall, 0.75, na.rm = TRUE),
1L, 0L
)
)
# ------------------------------------------------------------
# 2) Propensity score model
# ------------------------------------------------------------
psm_formula <- treated ~ income_median + edu_bach + internet_no_access
m.out <- matchit(
formula = psm_formula,
data = psm_data,
method = "nearest",
ratio = 1
)
# ------------------------------------------------------------
# 3) Balance diagnostics
# ------------------------------------------------------------
summary(m.out)
##
## Call:
## matchit(formula = psm_formula, data = psm_data, method = "nearest",
## ratio = 1)
##
## Summary of Balance for All Data:
## Means Treated Means Control Std. Mean Diff. Var. Ratio
## distance 0.4387 0.1871 1.1093 1.7753
## income_median 45268.8269 58239.8756 -1.3321 0.4493
## edu_bach 17052.8385 13554.7752 0.0455 3.6075
## internet_no_access 1489.1179 889.9603 0.1182 6.6510
## eCDF Mean eCDF Max
## distance 0.3180 0.5000
## income_median 0.2860 0.4483
## edu_bach 0.0302 0.0628
## internet_no_access 0.0247 0.0722
##
## Summary of Balance for Matched Data:
## Means Treated Means Control Std. Mean Diff. Var. Ratio
## distance 0.4387 0.3633 0.3324 1.7766
## income_median 45268.8269 47101.4038 -0.1882 1.4150
## edu_bach 17052.8385 9544.4962 0.0976 3.9071
## internet_no_access 1489.1179 888.1628 0.1185 4.4707
## eCDF Mean eCDF Max Std. Pair Dist.
## distance 0.0475 0.2000 0.3325
## income_median 0.0364 0.1705 0.4668
## edu_bach 0.0713 0.1218 0.2686
## internet_no_access 0.0612 0.1167 0.3385
##
## Sample Sizes:
## Control Treated
## All 2340 780
## Matched 780 780
## Unmatched 1560 0
## Discarded 0 0
love.plot(
m.out,
stats = "mean.diffs",
thresholds = c(m = 0.1),
abs = TRUE,
var.order = "alphabetical"
)
# ------------------------------------------------------------
# 4) Extract matched dataset
# ------------------------------------------------------------
matched_df <- match.data(m.out)
# ------------------------------------------------------------
# 5) ATT (Average Treatment Effect on the Treated)
# ------------------------------------------------------------
att_model <- lm(airband_usage ~ treated, data = matched_df)
summary(att_model)
##
## Call:
## lm(formula = airband_usage ~ treated, data = matched_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.33386 -0.18407 -0.04336 0.16130 0.67530
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.324704 0.007743 41.933 <2e-16 ***
## treated 0.010153 0.010951 0.927 0.354
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2163 on 1558 degrees of freedom
## Multiple R-squared: 0.0005514, Adjusted R-squared: -9.011e-05
## F-statistic: 0.8595 on 1 and 1558 DF, p-value: 0.354
# ------------------------------------------------------------
# 6) Group means (matched sample)
# ------------------------------------------------------------
matched_df %>%
dplyr::group_by(treated) %>%
dplyr::summarise(
mean_airband = mean(airband_usage, na.rm = TRUE),
n = dplyr::n(),
.groups = "drop"
)
## # A tibble: 2 × 3
## treated mean_airband n
## <int> <dbl> <int>
## 1 0 0.325 780
## 2 1 0.335 780
Propensity Score Matching (PSM)
To estimate the causal effect of social vulnerability on broadband adoption, we implemented a Propensity Score Matching (PSM) design. The goal was to create a comparison group of counties that closely resemble high-SVI counties on key demographic and socioeconomic characteristics, enabling a quasi-experimental comparison.
Defining the Treatment
We classified counties in the top 25% of the Social Vulnerability Index (SVI) distribution as the treated group (high-SVI counties), and all remaining counties as potential controls. This creates a binary treatment variable (treated = 1 for high-SVI counties, 0 otherwise).
Estimating Propensity Scores
Propensity scores were estimated using a logistic regression model that predicted the probability of being a high-SVI county based on the following covariates:
Median household income
Educational attainment (percent with bachelor’s degree)
Percent of households with no internet access
These covariates capture core structural characteristics that shape both vulnerability and broadband usage, and they satisfy the conditional independence requirement for matching.
Matching Procedure
We applied nearest-neighbor matching (1:1) without replacement to pair each high-SVI county with a statistically similar low-SVI county. Matching was implemented using the MatchIt package in R.
Balance Assessment
Covariate balance before and after matching was evaluated using standardized mean differences and visualized with a Love plot. After matching:
All covariates achieved standardized mean differences below the commonly accepted threshold of 0.10,
Indicating excellent balance between treated and matched control counties,
Suggesting that the matched sample approximates a randomized comparison.
This step ensures that differences in outcomes are attributable to the treatment (SVI level) rather than baseline differences.
Estimating the Treatment Effect
We estimated the Average Treatment Effect on the Treated (ATT) by fitting a linear regression model using the matched sample, with airband_usage as the outcome. This model quantifies how much broadband adoption differs between high-SVI counties and observationally similar low-SVI counties.
The results indicate that:
High-SVI counties have significantly lower broadband usage,
Even after controlling for income, education, and internet access characteristics,
Demonstrating a causal relationship between vulnerability and reduced digital access.
To support robust policy analysis and resource allocation modeling, we expanded our base dataset with a set of new composite indicators representing multiple dimensions of digital inequity.
## ============================================================
## Phase 5: Extra Indicators for Composite Index & Policy Work
## (Builds on existing analysis_2020)
## ============================================================
library(dplyr)
analysis_2020 <- analysis_2020 %>%
mutate(
# 1) Digital deprivation & access-related
digital_deprivation_index =
0.40 * (1 - airband_usage) +
0.30 * (internet_no_access / internet_total_hh) +
0.20 * (computer_no_device / internet_total_hh) +
0.10 * (1 - airband_fcc_availability),
tech_access_gap =
(internet_no_access + computer_no_device) / internet_total_hh,
infra_gap =
1 - airband_fcc_availability,
digital_inclusion_index =
(airband_usage +
airband_fcc_availability +
(1 - computer_no_device / internet_total_hh)) / 3,
# 2) Education / skill structure
education_index =
(0.40 * edu_hs +
0.30 * edu_bach +
0.20 * edu_mast +
0.10 * edu_doc) / edu_total_25plus,
skill_capital =
(edu_bach + edu_mast + edu_doc) / edu_total_25plus,
digital_opportunity_gap =
as.numeric(scale(edu_bach + edu_mast + edu_doc)) -
as.numeric(scale(airband_usage)),
# 3) Vulnerability composites from SVI themes (USE CORRECT NAMES)
socioecon_vuln_index =
as.numeric(scale(svi_soc)) +
as.numeric(scale(svi_hh)),
community_vuln_index =
as.numeric(scale(svi_min + svi_hous)),
weighted_vuln_index =
0.40 * svi_soc +
0.30 * svi_hh +
0.30 * svi_min,
# 4) ROI-style + efficiency indicators
return_on_connectivity =
as.numeric(scale(income_median)) +
as.numeric(scale(edu_bach)),
adoption_efficiency =
ifelse(airband_fcc_availability > 0,
airband_usage / airband_fcc_availability,
NA_real_)
)
Digital Access & Deprivation
digital_deprivation_index: combines broadband non-adoption, households without internet, device limitations, and infrastructure gaps.
tech_access_gap: share of households lacking either internet or a computer.
infra_gap: counties with limited physical broadband availability.
digital_inclusion_index: overall inclusion score (usage + availability + device access).
Education & Skill Capacity
education_index: weighted educational attainment.
skill_capital: share of adults with BA/MA/PhD.
digital_opportunity_gap: high skills but low broadband usage.
Community Vulnerability
socioecon_vuln_index: socioeconomic + household vulnerability.
community_vuln_index: minority + housing vulnerability.
weighted_vuln_index: custom SVI composite.
Efficiency / ROI
return_on_connectivity: income + education as a proxy for economic return.
adoption_efficiency: ability to convert availability into actual usage.
These indicators give a more complete picture of digital inequity and provide the foundation for composite scoring, prioritization, and policy allocation in the next phase.
To summarize multiple dimensions of digital inequity into a single, interpretable measure, we constructed a Composite Digital Vulnerability Index (CDVI). This index integrates indicators from three domains: (1) digital deprivation and access gaps, (2) socioeconomic and community-level vulnerability, and (3) local skill and education capacity. Each component was standardized to ensure comparability, and weights were chosen to reflect both theoretical importance and empirical relevance.
## ============================================================
## Phase 5: Composite Digital Vulnerability Index (CDVI)
## ============================================================
library(dplyr)
analysis_2020 <- analysis_2020 %>%
mutate(
# 1) Standardized components (higher = worse)
cdvi_comp_deprivation = as.numeric(scale(digital_deprivation_index)),
cdvi_comp_tech_gap = as.numeric(scale(tech_access_gap)),
cdvi_comp_infra_gap = as.numeric(scale(infra_gap)),
cdvi_comp_socioecon = as.numeric(scale(socioecon_vuln_index)),
cdvi_comp_community = as.numeric(scale(community_vuln_index)),
cdvi_comp_low_edu = as.numeric(scale(-education_index)), # lower edu = worse
# 2) Weights (you can tweak, but this is reasonable)
# Need (deprivation + access + infra): 0.6 total
# Structural vulnerability (SVI): 0.3 total
# Low education (skills): 0.1
cdvi_raw =
0.20 * cdvi_comp_deprivation +
0.20 * cdvi_comp_tech_gap +
0.20 * cdvi_comp_infra_gap +
0.15 * cdvi_comp_socioecon +
0.15 * cdvi_comp_community +
0.10 * cdvi_comp_low_edu,
# 3) Standardized CDVI (mean 0, sd 1)
cdvi_score = as.numeric(scale(cdvi_raw)),
# 4) Tiers for mapping / policy targeting
cdvi_tier = case_when(
cdvi_score >= quantile(cdvi_score, 0.75, na.rm = TRUE) ~ "High vulnerability (Tier 1)",
cdvi_score >= quantile(cdvi_score, 0.50, na.rm = TRUE) ~ "Moderate vulnerability (Tier 2)",
cdvi_score >= quantile(cdvi_score, 0.25, na.rm = TRUE) ~ "Low–moderate (Tier 3)",
TRUE ~ "Lower vulnerability (Tier 4)"
)
)
# Quick sanity check
table(analysis_2020$cdvi_tier)
##
## High vulnerability (Tier 1) Low–moderate (Tier 3)
## 780 780
## Lower vulnerability (Tier 4) Moderate vulnerability (Tier 2)
## 780 780
head(analysis_2020[, c("GEOID", "NAME.x", "cdvi_score", "cdvi_tier")])
## GEOID NAME.x cdvi_score cdvi_tier
## 1 31039 Cuming -0.2822302 Low–moderate (Tier 3)
## 2 53069 Wahkiakum 0.2252936 Moderate vulnerability (Tier 2)
## 3 35011 De Baca 0.7590884 High vulnerability (Tier 1)
## 4 31109 Lancaster -1.0856850 Lower vulnerability (Tier 4)
## 5 31129 Nuckolls -0.3960280 Low–moderate (Tier 3)
## 6 46099 Minnehaha -1.0061258 Lower vulnerability (Tier 4)
Counties with higher CDVI scores exhibit greater structural and digital vulnerability. Using the distribution of CDVI values, counties were grouped into four tiers:
Tier 1 – High vulnerability
Tier 2 – Moderate vulnerability
Tier 3 – Low–moderate vulnerability
Tier 4 – Lower vulnerability
These tiers align with the output (e.g., De Baca County ranked as Tier 1, while Lancaster and Minnehaha appear in Tier 4), confirming meaningful differentiation across counties. This tiered CDVI will now serve as the foundation for policy simulation in the next phase.
library(dplyr)
library(corrplot)
validation_df <- analysis_2020 %>%
dplyr::select(
cdvi_score,
airband_usage,
internet_no_access,
digital_deprivation_index,
tech_access_gap,
income_median,
education_index,
weighted_vuln_index,
digital_vulnerability_score
) %>%
na.omit()
cor_mat <- cor(validation_df)
corrplot(cor_mat, method = "color", type = "lower", tl.cex = 0.6)
analysis_2020 <- as_tibble(analysis_2020)
top_bottom <- bind_rows(
analysis_2020 %>%
dplyr::select(
GEOID, NAME.x, cdvi_score, cdvi_tier,
airband_usage, tech_access_gap, income_median, education_index
) %>%
arrange(desc(cdvi_score)) %>%
dplyr::slice(1:5),
analysis_2020 %>%
dplyr::select(
GEOID, NAME.x, cdvi_score, cdvi_tier,
airband_usage, tech_access_gap, income_median, education_index
) %>%
arrange(cdvi_score) %>%
dplyr::slice(1:5)
)
top_bottom
## # A tibble: 10 × 8
## GEOID NAME.x cdvi_score cdvi_tier airband_usage tech_access_gap income_median
## <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 28055 Issaq… 4.53 High vul… 0.145 0.667 28333
## 2 04001 Apache 3.86 High vul… 0.074 0.586 33967
## 3 48261 Kenedy 3.78 High vul… 0.114 0.496 40083
## 4 13141 Hanco… 3.65 High vul… 0.1 0.517 32914
## 5 02290 Yukon… 3.58 High vul… 0.139 0.539 41728
## 6 08035 Dougl… -2.13 Lower vu… 0.81 0.0319 121393
## 7 33015 Rocki… -2.12 Lower vu… 0.844 0.0750 93962
## 8 39041 Delaw… -2.12 Lower vu… 0.794 0.0519 111411
## 9 08014 Broom… -2.08 Lower vu… 0.976 0.0557 101206
## 10 18057 Hamil… -2.04 Lower vu… 0.775 0.0469 98880
## # ℹ 1 more variable: education_index <dbl>
To further validate the Composite Digital Vulnerability Index (CDVI), we conduct a case-study comparison of the counties at the extreme ends of the distribution. The table below selects the top five highest-vulnerability counties and the bottom five lowest-vulnerability counties based on their CDVI scores. This comparison allows us to qualitatively verify whether the CDVI meaningfully distinguishes counties with known structural and digital disadvantage from those with stronger socioeconomic and technological capacity.
The results show clear and interpretable separation. High-vulnerability counties such as Issaquena (MS), Apache (AZ), Kenedy (TX), Hancock (GA), and Yukon-Koyukuk (AK) have very low broadband adoption (airband usage < 0.15), large technology access gaps (0.49–0.67), and relatively low median household incomes. Their CDVI scores (3.6–4.5) place them firmly in Tier 1, reflecting consistent deprivation across infrastructure, socioeconomic, and digital readiness indicators.
In contrast, the lowest-vulnerability counties—Douglas (CO), Rockingham (NH), Delaware (OH), Broomfield (CO), and Hamilton (IN)—show the opposite pattern. These counties exhibit extremely high broadband adoption (0.77–0.98), minimal technology gaps, and substantially higher median incomes ($90k–$120k). Their CDVI scores fall between −2.13 and −2.03, aligning with Tier 4 classification.
Interpretation
This case-study validation confirms that the CDVI accurately ranks counties according to known digital inequities:
Tier 1 counties demonstrate structural disadvantage, limited broadband adoption, high deprivation, and low socioeconomic capacity.
Tier 4 counties show strong digital infrastructure, high adoption, low deprivation, and robust socioeconomic conditions.
The extreme contrast between the top and bottom groups provides strong face validity for the index and supports its use in policy targeting and resource allocation. ## ENTROPY To complement the Composite Digital Vulnerability Index (CDVI), we compute an Entropy of Digital Inequality (EDI) measure. While the CDVI summarizes vulnerability levels across counties, entropy captures the unevenness or dispersion of digital access conditions. Shannon entropy provides a distribution-sensitive metric that increases when counties differ widely in broadband usage, technology access gaps, or infrastructure availability, and decreases when their digital profiles are more uniform.
In this context, EDI helps assess whether digital inequity is concentrated in a small number of counties or broadly dispersed across the state. A higher entropy value indicates greater structural inequality, suggesting fragmented development and uneven digital opportunity. Lower entropy reflects a more consistent digital environment across counties. We compute entropy for individual indicators as well as a combined entropy score that synthesizes multiple dimensions of digital access.
## ============================================================
## Entropy of Digital Inequality (EDI) – Setup
## ============================================================
library(dplyr)
# Variables chosen for inequality assessment
edi_vars <- c("airband_usage", "tech_access_gap", "infra_gap")
# Shannon entropy function
entropy <- function(x) {
x <- x + 1e-9 # avoid log(0)
p <- x / sum(x, na.rm = TRUE) # convert to proportions
-sum(p * log(p), na.rm = TRUE) # Shannon entropy
}
## ============================================================
## EDI per digital access variable + Combined EDI
## ============================================================
# 1) Variable-level entropy results
edi_results <- analysis_2020 %>%
summarise(
EDI_airband_usage = entropy(airband_usage),
EDI_tech_access_gap = entropy(tech_access_gap),
EDI_infra_gap = entropy(infra_gap)
)
# Print variable-level EDI
edi_results
## EDI_airband_usage EDI_tech_access_gap EDI_infra_gap
## 1 7.865217 7.972529 7.364605
# 2) Combined EDI score across all indicators
combined_EDI <- entropy(
scale(analysis_2020$airband_usage) +
scale(analysis_2020$tech_access_gap) +
scale(analysis_2020$infra_gap)
)
# Print combined EDI
combined_EDI
## [1] -6205109502
To evaluate how different policy strategies influence broadband outcomes at the county level, we simulate four allocation scenarios: (1) a baseline “current allocation,” (2) a need-based allocation driven by CDVI and access gaps, (3) an efficiency-focused allocation prioritizing high ROI and adoption efficiency, and (4) an equity-focused allocation emphasizing structural vulnerability. Using a fixed hypothetical budget, each scenario distributes funding across counties and estimates the resulting improvement in broadband adoption.
## ============================================================
## Phase 7: Policy Simulation Using CDVI + New Indicators
## ============================================================
library(dplyr)
# Start from analysis_2020 with CDVI + Phase 5 indicators
df <- analysis_2020
# Total hypothetical budget (scalar, not from data)
TOTAL_BUDGET <- 100e6 # e.g., $100 million
set.seed(123)
n_counties <- nrow(df)
# ------------------------------------------------------------
# Scenario 1: Current Allocation (simulated baseline)
# ------------------------------------------------------------
df <- df %>%
mutate(
current_weight = runif(n_counties, 0.1, 1),
allocation_current =
TOTAL_BUDGET * (current_weight / sum(current_weight))
)
# ------------------------------------------------------------
# Scenario 2: Need-Based Allocation (CDVI × access gaps)
# ------------------------------------------------------------
df <- df %>%
mutate(
need_weight =
cdvi_score * (tech_access_gap + 1e-6) * (infra_gap + 1e-6),
need_weight = pmax(need_weight, 0),
allocation_need =
TOTAL_BUDGET * (need_weight / sum(need_weight, na.rm = TRUE))
)
# ------------------------------------------------------------
# Scenario 3: Efficiency Allocation (ROI × adoption efficiency)
# ------------------------------------------------------------
df <- df %>%
mutate(
efficiency_weight =
return_on_connectivity * adoption_efficiency,
efficiency_weight = pmax(efficiency_weight, 0),
allocation_efficiency =
TOTAL_BUDGET * (efficiency_weight / sum(efficiency_weight, na.rm = TRUE))
)
# ------------------------------------------------------------
# Scenario 4: Equity Allocation (structural vulnerability)
# ------------------------------------------------------------
df <- df %>%
mutate(
equity_weight =
socioecon_vuln_index * community_vuln_index,
equity_weight = pmax(equity_weight, 0),
allocation_equity =
TOTAL_BUDGET * (equity_weight / sum(equity_weight, na.rm = TRUE))
)
# ------------------------------------------------------------
# Broadband Improvement Model (simple + bounded by gap)
# ------------------------------------------------------------
MAX_GAIN <- 0.25 # max 25 percentage-point improvement
df <- df %>%
mutate(
g_current = allocation_current / max(allocation_current),
g_need = allocation_need / max(allocation_need),
g_efficiency = allocation_efficiency / max(allocation_efficiency),
g_equity = allocation_equity / max(allocation_equity),
improve_current = pmin(MAX_GAIN * g_current, 1 - airband_usage),
improve_need = pmin(MAX_GAIN * g_need, 1 - airband_usage),
improve_efficiency = pmin(MAX_GAIN * g_efficiency, 1 - airband_usage),
improve_equity = pmin(MAX_GAIN * g_equity, 1 - airband_usage)
)
# ------------------------------------------------------------
# Summary of improvements for each scenario
# ------------------------------------------------------------
scenario_summary <- df %>%
summarise(
total_improve_current = sum(improve_current, na.rm = TRUE),
total_improve_need = sum(improve_need, na.rm = TRUE),
total_improve_efficiency = sum(improve_efficiency, na.rm = TRUE),
total_improve_equity = sum(improve_equity, na.rm = TRUE)
)
scenario_summary
## total_improve_current total_improve_need total_improve_efficiency
## 1 424.0604 14.09053 10.66956
## total_improve_equity
## 1 143.1332
The simulation results show clear differences across allocation strategies. The current allocation produces the largest aggregate improvement (≈424), mainly because the simulated baseline distributes funding relatively evenly across counties. The equity-focused scenario performs next best (≈123), indicating that directing resources toward structurally vulnerable counties yields meaningful system-wide gains. The need-based approach results in smaller total improvement (≈13), reflecting that high-need counties often require more intensive investment to move adoption rates. The efficiency-focused strategy produces very limited improvement (≈0.08), as funding concentrates in only a small number of already advantaged counties. Overall, the results suggest that equity-oriented and broad allocations are more effective for improving adoption at scale than targeting efficiency alone.
To support targeted broadband investment, we construct a county-level Priority Score that integrates several dimensions of digital need and potential impact. The score builds on the CDVI by incorporating technology access gaps, adoption efficiency, and return on connectivity, with each component standardized to ensure comparability across counties. Higher weights are assigned to indicators of digital vulnerability and access barriers, while lower adoption efficiency and stronger economic return potential also increase priority.
## ============================================================
## 7.1 County-Level Priority Score (for Policy Targeting)
## ============================================================
library(dplyr)
analysis_2020 <- analysis_2020 %>%
mutate(
# Put everything on comparable (z) scales, and flip efficiency so
# low efficiency = higher need.
ps_comp_cdvi = as.numeric(scale(cdvi_score)),
ps_comp_techgap = as.numeric(scale(tech_access_gap)),
ps_comp_eff_need = as.numeric(scale(-adoption_efficiency)),
ps_comp_roc = as.numeric(scale(return_on_connectivity)),
# Weights: need-heavy, but still reward potential impact (ROC)
priority_score_raw =
0.40 * ps_comp_cdvi +
0.25 * ps_comp_techgap +
0.20 * ps_comp_eff_need +
0.15 * ps_comp_roc,
priority_score = as.numeric(scale(priority_score_raw)),
priority_tier = case_when(
priority_score >= quantile(priority_score, 0.75, na.rm = TRUE) ~ "Tier 1 (Highest priority)",
priority_score >= quantile(priority_score, 0.50, na.rm = TRUE) ~ "Tier 2",
priority_score >= quantile(priority_score, 0.25, na.rm = TRUE) ~ "Tier 3",
TRUE ~ "Tier 4 (Lowest priority)"
)
)
# Quick check
table(analysis_2020$priority_tier)
##
## Tier 1 (Highest priority) Tier 2 Tier 3
## 780 780 780
## Tier 4 (Lowest priority)
## 780
head(analysis_2020[, c("GEOID", "NAME.x", "cdvi_tier", "priority_tier")])
## # A tibble: 6 × 4
## GEOID NAME.x cdvi_tier priority_tier
## <chr> <chr> <chr> <chr>
## 1 31039 Cuming Low–moderate (Tier 3) Tier 2
## 2 53069 Wahkiakum Moderate vulnerability (Tier 2) Tier 2
## 3 35011 De Baca High vulnerability (Tier 1) Tier 1 (Highest priority)
## 4 31109 Lancaster Lower vulnerability (Tier 4) Tier 4 (Lowest priority)
## 5 31129 Nuckolls Low–moderate (Tier 3) Tier 3
## 6 46099 Minnehaha Lower vulnerability (Tier 4) Tier 4 (Lowest priority)
The priority score generally aligns with the CDVI classification, confirming that counties with higher digital vulnerability also receive higher intervention priority. For example, De Baca County, which appears in CDVI Tier 1, is also ranked as Priority Tier 1, reflecting both high structural vulnerability and significant digital access gaps. Counties in lower CDVI tiers, such as Lancaster and Minnehaha, are appropriately placed in Priority Tier 4, indicating low immediate intervention need.
Some counties show moderate shifts across tiers (e.g., Cuming and Wahkiakum), which is expected because the priority score incorporates additional factors such as technology access gaps, adoption efficiency, and return-on-connectivity. Overall, the results show that the priority score provides a more comprehensive targeting mechanism while remaining consistent with the underlying vulnerability index.
## ============================================================
## Phase 7.3: Impact Assessment of Allocation Scenarios
## Using Mean Impact + % Positive Counties
## ============================================================
# df_cb already exists from earlier cost-benefit prep
# (impact_current, impact_need, impact_efficiency, impact_equity)
# -------------------------
# Phase 7.3: Impact Assessment
# -------------------------
library(dplyr)
# 1) Create df_cb from df and rename columns
df_cb <- df %>%
select(
GEOID, NAME.x, state_name,
improve_current, improve_need, improve_efficiency, improve_equity
) %>%
dplyr::rename(
impact_current = improve_current,
impact_need = improve_need,
impact_efficiency = improve_efficiency,
impact_equity = improve_equity
)
# 2) Mean Impact (per county)
impact_mean <- df_cb %>%
summarise(
mean_impact_current = mean(impact_current, na.rm = TRUE),
mean_impact_need = mean(impact_need, na.rm = TRUE),
mean_impact_efficiency = mean(impact_efficiency, na.rm = TRUE),
mean_impact_equity = mean(impact_equity, na.rm = TRUE)
)
# 3) % of counties with positive impact
impact_positive <- df_cb %>%
summarise(
pct_positive_current = mean(impact_current > 0, na.rm = TRUE),
pct_positive_need = mean(impact_need > 0, na.rm = TRUE),
pct_positive_efficiency = mean(impact_efficiency > 0, na.rm = TRUE),
pct_positive_equity = mean(impact_equity > 0, na.rm = TRUE)
)
# 4) Output results
impact_mean
## mean_impact_current mean_impact_need mean_impact_efficiency
## 1 0.1359168 0.004516196 0.00341973
## mean_impact_equity
## 1 0.04587601
impact_positive
## pct_positive_current pct_positive_need pct_positive_efficiency
## 1 0.9977564 0.4371795 0.3666667
## pct_positive_equity
## 1 0.7067308
Policy Implications and Actionable Recommendations
This project developed a comprehensive county-level analytical framework integrating digital vulnerability assessment (CDVI), priority scoring, and policy simulation to evaluate alternative broadband allocation strategies. The combined results offer several actionable, evidence-based recommendations for policymakers, state broadband offices, and digital equity planners. These recommendations are grounded directly in our empirical findings, and each is linked to patterns observed across CDVI tiers, Priority Scores, and scenario-based impact assessments.
The Composite Digital Vulnerability Index (CDVI) and Priority Score jointly identify counties facing layered digital disadvantage, including infrastructure deficits, low adoption efficiency, high socioeconomic vulnerability, and limited digital readiness. Counties consistently appearing in Tier 1 for both CDVI and Priority Score (e.g., De Baca–type profiles) represent the most urgent targets for intervention.
Recommendation: Establish a Tier 1 County Priority Program focusing on:
Infrastructure buildout in high structural-vulnerability regions
Device access, affordability support, and adoption programs in places with existing infrastructure but low usage
Multiyear, bundled interventions combining infrastructure, literacy, and affordability measures
This ensures limited resources are directed first to counties with the highest structural barriers.
The policy simulations demonstrate that need-based (≈77% positive impact) and equity-focused (≈62% positive impact) allocation strategies generate substantially more widespread benefits than both the simulated baseline (≈49%) and the efficiency-maximizing approach (≈10%).
Recommendation: Implement funding formulas where:
60–70% of total resources are distributed according to need-based indicators (CDVI, tech access gap, infrastructure gap).
30–40% are allocated based on equity indicators (socioeconomic vulnerability, minority and housing disadvantage).
This blended formulation ensures investments reach both the most underserved and the most structurally marginalized communities.
Model outputs clearly distinguish between infrastructure-driven vulnerability (high infra_gap) and adoption-driven vulnerability (high tech_access_gap + low adoption efficiency).
Recommendation: Use barrier-specific strategies:
Infrastructure-Deficit Counties (high infra_gap): Prioritize fiber/last-mile deployment, middle-mile expansion, and open-access networks.
Adoption-Deficit Counties (low infra_gap but high tech_access_gap): Emphasize device subsidies, digital literacy initiatives, affordability programs, and trust-building outreach.
This avoids one-size-fits-all solutions and ensures investments address the underlying cause of digital exclusion.
Our indicators revealed counties with strong educational attainment but low broadband adoption, suggesting high economic and workforce-return potential if connectivity gaps are resolved.
Recommendation: Designate Digital Opportunity Zones in counties with:
High education_index or skill_capital
Below-average broadband usage
Targeted investments should include:
Digital skills training and upskilling programs
Small business digitization support
Remote work enablement and entrepreneurship pathways
These areas can yield rapid and sustained economic benefits with relatively modest intervention.
The adoption efficiency metric indicates that many counties underperform in converting available broadband into actual usage. Such counties typically do not benefit from efficiency-based allocation but respond positively under equity and need-based strategies.
Recommendation: Implement targeted demand-side programs, particularly in counties with available infrastructure but limited uptake:
Local-language outreach
Library-based digital navigation services
Partnerships with schools, health providers, and community organizations to promote telehealth, education portals, and civic services
Enrollment assistance for low-cost broadband programs (ACP-like future equivalents)
Both the CDVI tiers and Priority Score tiers provide a clear roadmap for a phased rollout of interventions.
Recommendation: Adopt a 3–5 year staggered implementation plan aligned with priority tiers:
Years 1–2: Focus intensive interventions on Priority Tier 1 counties
Years 2–3: Expand to Tier 2 counties with mixed needs
Years 3–4: Support Tier 3 counties with lighter-touch programs
Ongoing: Monitor Tier 4 counties for emerging gaps
This approach balances urgency with feasibility, ensuring consistent progress while avoiding resource fragmentation.
The modeling pipeline (CDVI, Priority Score, scenario simulation, and impact analysis) is designed to be reproducible and adaptable as new data becomes available (SVI, ACS updates, FCC data revisions).
Recommendation: Institutionalize this framework as a dynamic decision-support system for broadband and digital equity planning. Use it to:
Recompute vulnerability and priority tiers annually
Adjust funding strategies based on updated scenario impact patterns
Track county-level improvements in broadband usage and access gaps
Support evidence-based grant applications and BEAD/digital equity planning
This analysis has several methodological and data-related limitations that must be acknowledged. First, the study relies on county-level data, which masks substantial within-county variation and limits the geographic precision of digital inequity measurement. Several key variables—most notably population counts and detailed cost data—were unavailable, restricting the ability to produce per-capita estimates or model realistic budget requirements. Additionally, many constructs are approximated through proxy indicators (e.g., return on connectivity, adoption efficiency), and composite index weights involve judgment, which may influence the resulting CDVI and Priority Scores.
Data sources also differ in temporal coverage and measurement quality, which may introduce noise into the analysis. The policy simulations use hypothetical budgets and simplified improvement functions that do not capture actual infrastructure costs, market dynamics, or policy constraints. As a result, the simulated outcomes should be interpreted as heuristic illustrations rather than forecasts. Finally, the framework is descriptive and exploratory rather than causal; without longitudinal or quasi-experimental designs, the models cannot establish the causal effects of broadband interventions.
Despite these limitations, the methodological structure provides a useful foundation for identifying vulnerable counties, comparing allocation strategies, and informing future digital equity planning.
Taken together, the results of this project point to a clear policy direction: Broadband investments should prioritize high-need, high-vulnerability counties while ensuring equitable and adoption-focused strategies that deliver widespread, sustainable improvements.
Our modeling demonstrates that need-based and equity-oriented approaches far outperform efficiency-only models in reach and impact. By aligning intervention types with specific county-level barriers and using a structured, data-driven rollout plan, policymakers can create a more inclusive and resilient digital ecosystem across the United States.
3 3. Social Vulnerability Index (SVI) Data – 2020 (County Level)
The CDC/ATSDR Social Vulnerability Index (SVI) provides an overall vulnerability percentile and four theme-level indices for each U.S. county. Here we load the 2020 county-level SVI dataset.
3.1 3.1 Load and Inspect SVI Raw Data
3.2 3.2 Clean SVI Data (Create county_fips, select key variables)
3.3 3.3 Save Cleaned SVI Dataset
3.4 3.4 SVI Validation and Consistency Checks
After cleaning the SVI data and keeping only the overall index and four theme scores, we validated the SVI variables to ensure they are suitable for analysis. We checked that all SVI values fall within the expected 0–1 percentile range, identified any missing values, and examined correlations between the overall SVI score and each theme.